Tue, 03 Feb 2009
Client-side Redundancy
Many network protocols out there have some kind of client-side redundancy built
in the client side:
for example, DNS can ask the second nameserver from
/etc/resolv.conf
,
should the first one be too slow to reply in time. For LDAP, multiple
LDAP servers can be set up in /etc/ldap.conf
. The same
with Kerberos, SMTP, and many others. Nevertheless, I think depending
(solely) on the client-side redundancy in network protocols should be
considered harmful. There are many problems with it:
- The information about server availability is not shared even within
the same computer. Should the first nameserver in
resolv.conf
die, all programs on the same computer will try to contact it first, wait 5 seconds, and then fall back to the second entry inresolv.conf
. This was not a problem 10 years ago, but these days, users are not willing to wait five seconds for every DNS request while you reboot the DNS server for a kernel upgrade. - The problem is much worse when the primary server is "almost" dead.
Yesterday our primary LDAP server died in such a strange way
that it still accepted TCP connections, but the userland was dead. So all
nscd(8)
daemons in our network just tried to connect, and when the connection succeeded, did not even attempt to contact the secondary LDAP server. No LDAP replies until the primary server was restarted.
Therefore I think the redundancy for such latency-sensitive services like DNS, Kerberos, or LDAP should be maintained on the server side using things like Heartbeat and a STONITH device. This avoids the "half-dead" server state, and gives the clients a single IP address to talk with. Fortunately, many client-side protocol libraries have a separate server for write access (such as changing the Kerberos password). So the writes can be redirected to a master server, and reads can be done from a set of two, heartbeat-redundant servers.
Which is what we currently do for DNS and DHCP, and I am thinking about doing so for LDAP and Kerberos as well. The client-side redundancy can be an added bonus, but not a primary solution. How do you handle the redundancy of the network services?