Bonding and arp monitoring

This is a discussion on Bonding and arp monitoring within the Linux Networking forums, part of the Linux Forums category; Hi, I have the following setup, multiple HP BL30p blade servers running Red Hat ES3 - kernel 2.4.21-32....


Go Back   Usenet Forums > Linux Forums > Linux Networking

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 07-15-2005
mathias.kanstrup@ongame.com
 
Posts: n/a
Default Bonding and arp monitoring

Hi,

I have the following setup, multiple HP BL30p blade servers running Red
Hat ES3 - kernel 2.4.21-32.0.1.ELsmp.
Alle servers in the chassi share two internal switches, each switch has
24 ports, where 16 ports are 'down link' ports
to the servers (presented as eth0 - Switch A and eth1 - Switch B), 2
ports for interconnectivity between the switches (disabled) and the
remaining 4 ports are physical ports.
Each switch has one physical port configured as trunk and is connected
to a upstream Cisco 6513 chassi. All of the blade servers are in the
same VLAN.

_________________________________
| HP BL30p servers |
| 1 2 3 4 5 6 7 8 |
<<HP Switch A>>>-<<<HP Switch B>>
| |
| |
<Cisco Switch A>-<Cisco Switch B>
| |
| |
<<<<<<Router/Default gateway>>>>>

To get high-availability on the network connection, bonding in
active/backup mode is being used.
As the 16 down link ports are internal ports and will never have a link
failure (except if the whole switch suffers from hw-error) arp
monitoring is used to monitor the default gateway of the servers.


>From /etc/modules.conf

alias bond0 bonding
options bond0 mode=1 arp_interval=1000 arp_ip_target=Router_IP

Without simulating any link failures everything works fine, e.g both
eth0 (HP Switch A) and eth1 (HP Switch B) can function as primary
interface without any problems.

Tcpdumping on the bond0 interface shows a lot of arp-whowas requests
for the Router_IP.
To simulate a failure the link between Cisco Switch A and HP Switch A
is removed, this is NOT detected by the bonding module and leaves the
servers with eth0 active unreachable.

But if I for example lowers the arp_interval to something like 60 for
one of the servers, this server will notice the above link failure.
Fine let's lower the arp_interval to 60ms for all of the servers, then
we are back with the same problem, the bonding module does no detect
the failure of reaching Router_IP.

Looking at /usr/src/linux-2.4/Documentation/networking/bonding.txt

1. Driver support

The ARP monitor relies on the network device driver to maintain two
statistics: the last receive time (dev->last_rx), and the last
transmit time (dev->trans_start). If the network device driver does
not update one or both of these, then the typical result will be that,
upon startup, all links in the bond will immediately be declared down,
and remain that way. A network monitoring tool (tcpdump, e.g.) will
show ARP requests and replies being sent and received on the bonding
device.

And at /usr/src/linux-2.4/drivers/net/bonding/bond_main.c

/*
* When using arp monitoring in active-backup mode, this function is
* called to determine if any backup slaves have went down or a new
* current slave needs to be found.
* The backup slaves never generate traffic, they are considered up by
merely
* receiving traffic. If the current slave goes down, each backup slave
will
* be given the opportunity to tx/rx an arp before being taken down -
this
* prevents all slaves from being taken down due to the current slave
not
* sending any traffic for the backups to receive. The arps are not
necessarily
* necessary, any tx and rx traffic will keep the current slave up.
While any
* rx traffic will keep the backup slaves up, the current slave is
responsible
* for generating traffic to keep them up regardless of any other
traffic they
* may have received.
* see loadbalance_arp_monitor for arp monitoring in load balancing
mode
*/
static void bond_activebackup_arp_mon(struct net_device *bond_dev)
{
..
..
..
..
..
if (slave) {
/* if we have sent traffic in the past 2*arp_intervals
but
* haven't xmit and rx traffic in that time interval,
select
* a different slave. slave->jiffies is only updated
when
* a slave first becomes the curr_active_slave - not
necessarily
* after every arp; this ensures the slave has a full
2*delta
* before being taken out. if a primary is being used,
check
* if it is up and needs to take over as the
curr_active_slave
*/

The question is, does the other blade servers arp queries (arp
monitoring the Router_IP) affect the last_rx and trans_start counters
on the other servers in the same vlan/chassi?
Does the arp_monitoring function really monitors a host or does it
simply rely on counters of the interface? And are those counters
affected by arp queries from nearby servers?
If so would that mean it is impossible to use arp monitoring if other
internal traffic/broadcasting is done in the same vlan?

Any help is appreciate, I am sure that this scenario must be possible
to implement, or?


Best regards

Mathias Kanstrup

Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 08:46 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0