265 lines
17 KiB
Plaintext
265 lines
17 KiB
Plaintext
12/11/00 JosephJ Fix for #23727
|
|
23727 wlbs drain all command should return an error message
|
|
if no port rules exist.
|
|
|
|
The problem (if you can call it that) is that if there are NO user-specified
|
|
port rules, we treat port-specific operations directed to "ALL" ports as
|
|
successful. These commands are start,stop, drain and set (adjust weights).
|
|
Fix is for Load_port_change to return IOCTL_CVY_NOT_FOUND in this case.
|
|
Note that Load_port_change does some special casing for
|
|
IOCTL_CVY_CLUSTER_DRAIN and IOCTL_CVY_CLUSTER_PLUG -- it includes
|
|
the default port rule.
|
|
|
|
07.17.01 shouse
|
|
Due to a change in user-space where we no longer disable and re-enable the
|
|
adapter when the MAC address changes, the ded_mac_addr will now ALWAYS be
|
|
the burnt-in MAC address of the adapter, whereas it has been the NLB 02-bf
|
|
MAC address because by the time NLB bound to the adapter, it had already
|
|
picked up the new MAC address. Now, that is no longer the case, which
|
|
should not be a problem because all indications are that this was the way
|
|
that it was in win2k until we started disabling/enabling the adapters in
|
|
SP 1. However, an alignment issue resulted in a bug fix that appears to
|
|
rely on the fact that in unicast mode, the ded_mac_addr is the cl_mac_addr.
|
|
This fix was a hack, and doesn't seem to have really been thought out
|
|
anyway, because the code added was guaranteed to always be a no-op; it
|
|
amounted to "if (foo == 2) { foo = 2; }. Anyway, this "fix" was also
|
|
only applied in one of three places the exact same code resisded, so the
|
|
fixed "fix" has also been propagated to all three places. The fix involves
|
|
spoofing source MAC addresses in unicast mode to prevent network switches
|
|
from learning the cluster MAC address. Rather than simply casting a
|
|
pointer to a PULONG and dereferencing it to set a ULONG, which may cause
|
|
an alignment fault, we set each byte of the ULONG individually to avoid
|
|
the alignment issue.
|
|
|
|
10.21.01 shouse
|
|
Amendment to the above statement concerning the dedicated MAC address. It
|
|
appears that since sending a property change notification to the NIC results
|
|
in NDIS tearing down and rebuilding all bindings, by the time the adapter
|
|
is back up and running and NLB queries for the dedicated MAC address, the
|
|
adapter will have already picked up the 02-bf MAC address, so the statement
|
|
that the dedicated MAC address would now be the burnt-in MAC is not entirely
|
|
accurate.
|
|
|
|
10.21.01 shouse
|
|
Some lingering issues and their resolutions from a conversation with Bill Bain:
|
|
|
|
Dirty connections: The real question has been, "Why the seemingly arbitrary
|
|
five minute timeout?" Well, it turns out that the value is not arbitrary,
|
|
but rather was measured and based on empirical evidence. If a large number
|
|
of connections were left dagling by NLB when a "stop" was performed, this
|
|
would result in a reset "storm" if the host was quickly added back into the
|
|
cluster. It was observed that if NLB could block this traffic to the host
|
|
with the stale data, NLB could _significantly_ reduce the reset problems. So,
|
|
while its true that this five minutes is no silver bullet, it was based on
|
|
real measurable data available and solved the problem for a significant
|
|
number of the stale connections.
|
|
|
|
PPTP: Of course, PPTP was supposed to be supported in Windows 2000, but a
|
|
cursory look at the source code shows that tracking the calls, which are
|
|
GRE packets, did NOT work in Windows 2000. GRE packets were supposed to be
|
|
treated like TCP data packets on the PPTP tunnel (TCP connection), and since
|
|
no port numbers from the PPTP tunnel are recoverable in a GRE packet, NLB
|
|
hard-coded the source and destination ports to zero and 1723, respectively.
|
|
the 1723 corresponds to the server port number of the PPTP tunnel and the zero
|
|
is arbitrary and as good a choice for a source port as any. So, GRE packets
|
|
would be hashed the same as the TCP tunnel in single affinity, sticking the
|
|
GRE traffic to the correct host. However, when ambiguity arose (unoptimized
|
|
mode), GRE packets were looking for a descriptor with a source port of zero
|
|
and a destination port of 1723. Because the tunnel was established with the
|
|
ephemoral port assigned by TCP on the client machine, no descriptor would
|
|
EVER be found, and the packets were discarded. What was _intended_ was to
|
|
create the descriptor for the PPTP tunnel using the same hard-coded source
|
|
port of zero. In that case, GRE packets would find a matching descriptor
|
|
when necessary. This was the small piece of logic missing in Windows 2000,
|
|
which will be added in an upcoming service pack. However, this fix eliminates
|
|
any method by which NLB could distinguish multiple PPTP tunnels from the same
|
|
client IP address (since the client ports are masked). So, a limitation of
|
|
this implementation is that clients may NOT establish multiple tunnels (which
|
|
they won't by default) and clients from behind a NAT are not supported, as
|
|
multiple clients from behind a NAT would look like the same client to NLB,
|
|
differentiated only by source port, which NLB cannot distinguish.
|
|
|
|
Fragmentation: NLB has had an "optimized" fragmentation mode in it that
|
|
didn't seem to make sense. The problem is that subsequent packets in a
|
|
fragmented segment will not have the TCP/UDP ports, which NLB needs in order
|
|
to properly filter them. The "unoptimized" mode said that if the packet in
|
|
question was the first packet of a fragment, then NLB can get to the port
|
|
numbers, so it will be treated normally and passed up only on the correct host.
|
|
Subsequent packets in the fragmented segment will not have the port numbers,
|
|
so NLB would pass them up on _all_ hosts in the cluster. The IP layer would
|
|
simply drop the fragments on the hosts that did not pass up the first packet
|
|
in the fragmented segment. So, other than a bit of extra stress on the IP
|
|
layer in the stack, this method should be guaranteed to work. The "optimized"
|
|
mode was a method by which to let NLB do the filtering in the limited cases
|
|
that it could. Basically, this mode asserted that if you have a single port
|
|
rule that covers all ports (0-65535), then the server port is essentially
|
|
irrelevent - you'd lookup the same port rule regardless of what the port
|
|
actually was. Further, if that port rule was configured in single affinity,
|
|
then the client port was also irrelevent - its not used in the hashing
|
|
algorithm. If the cluster is configured as such (which happens to be the
|
|
default), then NLB need not know the actual source ports to pass the packet
|
|
up ONLY on the correct host. Well, that is almost correct. It is true that
|
|
the client and server ports then become irrelevent insofar as port rule
|
|
lookup and hashing, but they ARE needed for descriptor lookup - if we're
|
|
hoping to find a matching connection descriptor in order to know which host
|
|
owns a particular connection, we need to know the _actual_ client and server
|
|
ports to match a descriptor. So, this "optimized" mode doesn't really work
|
|
after all. However, as it turns out, in Windows 2000, where it was introduced,
|
|
it DID actually work. That's assuming that you discount TCP, through which
|
|
fragmentation is _highly_ discouraged by setting maximum segment sizes
|
|
appropriately, then for UDP/GRE/IPSEC it DID work because those protocols did
|
|
not utilize descriptors at all - their ownership was based solely on who
|
|
currently owned the bucket to which the packet mapped. So, its a bit muddled,
|
|
but did "work" in Windows 2000. In .Net server however, this "optimized" mode
|
|
has been removed because it no longer works. This is because some UDP traffic,
|
|
namely IPSec (port 500) is now tracked through the use of descriptors. This
|
|
failure was actually found through IPSec testing in which the initial fragment
|
|
went up on the correct server, but the subsequent fragment went up on the
|
|
_wrong_ server (not all servers, as it would have in "unoptimized" mode). GRE
|
|
and IPSec protocol traffic use hard-coded ports in connection tracking, so they
|
|
continue to be ambivolent to fragments.
|
|
|
|
12.05.01 chrisdar
|
|
BUG 482284 NLB: stores its private state in wrong Ndis packet causes break
|
|
during standby
|
|
|
|
When there is no packet stack available in an NDIS packet for NLB to store
|
|
information, NLB needs to allocate an NDIS packet for its own use, copy the
|
|
information from the original packet into it, then deallocate it when we are
|
|
finished using it. One place where this happens is in a rarely executed code
|
|
path of Prot_recv_indicate. The bug was that in this code path, we subsequently
|
|
used the original packet and tried to access packet stack that wasn't available.
|
|
The packet we allocated to get packet stack wasn't used. The fix is to use the
|
|
allocated packet instead of the original.
|
|
|
|
While testing a private fix in the lab, I also made temporary changes to force
|
|
Prot_recv_indicate to use this code path for every received non-remote control
|
|
packet.
|
|
|
|
1.21.02, shouse
|
|
Note: Due to recent changes in the GRE virtual descriptor tracking mechanism in
|
|
the driver, SINGLE affinity is now REQUIRED for PPTP. In general, single affinity
|
|
has always be "required" for VPN, but until this change was made, no affinity
|
|
would still have basically worked for PPTP. No affinity WILL STILL WORK for IPsec,
|
|
but only helps in the case that clients come from behind a NAT device; if they do
|
|
not come from behind a NAT, the source and destination ports are ALWAYS UDP 500
|
|
anyway, which defeats any advantage no affinity might provide.
|
|
|
|
Why did no affinity previously work for PPTP?
|
|
|
|
When a PPTP tunnel is created, NLB hashes the TCP control tunnel just like any
|
|
other TCP connection. If the affinity is set to none, then it uses the TCP port
|
|
numbers during the hashing process. If the host owns the bucket to which the
|
|
TCP SYN hashes, it accepts the connection and creates state to track the PPTP
|
|
tunnel. When a PPTP tunnel is accepted, it is also necessary to create a virtual
|
|
GRE descriptor to track the GRE call data for this tunnel. When this descriptor
|
|
is created, since no ports exist in the GRE protocol, it used the hard-coded ports
|
|
of 0 (source) and 1723 (destination). Since GRE is treated like TCP for the
|
|
purposes of port rule lookup and state maintenance, the GRE state creation in the
|
|
load module would certainly find the same port rule that the PPTP tunnel did; TCP
|
|
1723. However, if no affinity is set, it will NOT derive the same hashing result
|
|
that the PPTP tunnel did because the source (client) ports are different; an
|
|
arbitrary port number in the PPTP SYN packet and a hardcoded port number of 0 in
|
|
the GRE "virtual connection". Therefore, the load module would end up "injecting"
|
|
a descriptor into a port rule and "bucket" that it MIGHT NOT EVEN OWN (because bucket
|
|
ownership is not considered when creating these virtual descriptors that correspond
|
|
to a real connection being serviced by a host. In general, that's fine and by
|
|
the next heartbeat, the host that DOES own that bucket will notice and stop blindly
|
|
accepting traffic that hashes to that bucket (it moves in non-optimized mode). So,
|
|
while it SHOULD work in no affinity, this runs the risk of unnecessarily shifting
|
|
the cluster into non-optimized mode because hosts that are not the bucket owners
|
|
may handle connections on those buckets.
|
|
|
|
Why won't no affinity work any more?
|
|
|
|
Basically, because the second hash performed on the GRE "connection" has been removed.
|
|
Up-going PPTP tunnels used to require at least 3, and as many as 4, calls to the NLB
|
|
hash function. Because the hash function is a LARGE portion of the NLB overhead, this
|
|
is non-optimal, and, as it happens, unnecessary. By moving the virtual descriptor
|
|
and descriptor cleanup intelligence from main.c to load.c, these multiple calls to the
|
|
hash function were eliminated. A single hash is now performed on all packets. However,
|
|
when GRE virtual descriptors are created now, they use the hash value already computed
|
|
as part of the PPTP TCP SYN processing. This is a better solution, as it ensures that
|
|
both the PPTP TCP tunnel and the GRE virtual "connection" both belong to the same bucket,
|
|
and therefore the same host. This prevents us from unnecessarily putting the cluster
|
|
into a non-optimized state. However, when GRE data packets do arrive and need to hash
|
|
and perform a state lookup, there is no way that it can regenerate the same hash value
|
|
that was computed by the PPTP TCP tunnel setup if the affinity is set to none. That,
|
|
of course, is because the TCP source port of the PPTP tunnel is not recoverable from the
|
|
GRE packets. Therefore, to ensure that GRE packet lookup can re-calculate the necessary
|
|
hash value, single affinity is REQUIRED.
|
|
|
|
02/14/2002 JosephJ Location of fake ndis usermode code...
|
|
\\winsefre\nt5src\private\ntos\tdi\tcpipmerge\1394\arp1394\tests
|
|
|
|
04/15/2002 JosephJ To temporarily build the um ndis stuff (needs cleaning up)
|
|
#ifdef TESTPROGRAM
|
|
#include "rmtest.h"
|
|
#define KERNEL_MODE
|
|
#else
|
|
#include <ndis.h>
|
|
/* For querying TCP about the state of a TCP connection. */
|
|
#include "ntddtcp.h"
|
|
#include "ntddip.h"
|
|
#endif // !TESTPROGRAM
|
|
|
|
04/24/2002 JosephJ diplist: Added skeleton diplist code
|
|
diplist.c, diplist.h
|
|
Also added code under .\test to component test the diplist code.
|
|
|
|
04/24/2002 JosephJ diplist: Added the fast lookup functionality.
|
|
|
|
04/25/2002 JosephJ diplist: Changed internal constants to "production" values.
|
|
#define MAX_ITEMS 32 // TODO: replace by appropriate CVY constant.
|
|
#define HASH1_SIZE 257 // size (in bits) of bit-vector (make it a prime)
|
|
#define HASH2_SIZE 59 // size of hashtable (make it a prime)
|
|
|
|
08.16.02, shouse
|
|
The driver no longer fills in the pg_rsvd array in the heartbeat because it was
|
|
discovered that it routinely produces a Wake On LAN pattern in the heartbeat that
|
|
causes BroadCom NICs to panic. Although this is NOT an NLB issue, but rather a
|
|
firmware issue in BroadCom NICs, it was decided to remove the information from the
|
|
heartbeat to alleviate the problem for customers with BroadCom NICs upgrading to
|
|
.NET. This array is UNUSED by NLB, so there is no harm in not filling it in; it
|
|
was added a long time ago for debugging purposes as part of the now-defunct FIN-
|
|
counting fix that was part of Win2k SP1.
|
|
|
|
For future reference, should we need to use this space in the heartbeat at some
|
|
future point in time, it appears that we will need to be careful to avoid potential
|
|
WOL patterns in our heartbeats where we can avoid it. A WOL pattern is:
|
|
|
|
6 bytes of 0xFF, followed by 16 idential instances of a "MAC address" that can
|
|
appear ANYWHERE in ANY frame type, including our very own NLB heartbeats. E.g.:
|
|
|
|
FF FF FF FF FF FF 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
|
|
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
|
|
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
|
|
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
|
|
01 02 03 04 05 06
|
|
|
|
The MAC address need not be valid, however. In NLB heartbeats, the "MAC address"
|
|
in the mistaken WOL pattern is "00 00 00 00 00 00". NLB routinely fills heartbeats
|
|
with FF and 00 bytes, but it seems that by "luck" no other place in the heartbeat
|
|
seems this vulnerable. For instance, in the load_amt array, each entry has a
|
|
maximum value of 100 (decimal), so there is no possibility of generating the initial
|
|
6 bytes of FF to start the WOL pattern. All of the "map" arrays seem to be saved
|
|
by two strokes of fortune; (i) little endian and (ii) the bin distribution algorithm.
|
|
|
|
(i) Since we don't use the 4 most significant bits of the ULONGLONGs used to store
|
|
each map, the most significant bit is NEVER FF. Because Intel is little endian, the
|
|
most significant byte appears last. For example:
|
|
|
|
0F FF FF FF FF FF FF FF appears in the packet as FF FF FF FF FF FF 0F
|
|
|
|
This breaks the FF sequence in many scenarios.
|
|
|
|
(ii) The way the bin distribution algorithm distributes buckets to hosts seems to
|
|
discourage other possibilities. For instance, a current map of:
|
|
|
|
00 FF FF FF FF FF FF 00
|
|
|
|
just isn't likely. However, it IS STILL POSSIBLE! So, it is important to note that:
|
|
|
|
REMOVING THIS LINE OF CODE DOES NOT, IN ANY WAY, GUARANTEE THAT AN NLB HEARTBEAT
|
|
CANNOT STILL CONTAIN A VALID WAKE ON LAN PATTERN SOMEWHERE ELSE IN THE FRAME!!!
|