I have installed a Compaq NC6134 Gigabit Fibre NIC from an older server into a recently-bought DL380G3 server (which is the dedicated backup server) in order to improve the current throughput of the nightly backup regime.
Current speed of host to host backups using the inbuilt UTP interface varies from 20-200MB/min depending on size of volumes, number of files and age of the host being backed up.
(Servers include two DL360G1's and a dual-CPU ML570G1 with 12*72GB Ultra3 drives - the backup host is a DL380G3 which is running ARCserve7 for NetWare. All servers are on NetWare 6.0 SP3.)
The UTP port is connected to a 3COM 3300 switch and therefore runs at 100Mb/s Full Duplex.
The 3300 switch connects to a 3COM Super Stack II 9300 Gigabit fibre switch which connects to the other host servers via their own Compaq NC6134 Gigabit fibre NIC's.
(so upgrading the ARCserve server to a Gigabit Fibre card and placing it onto the 9300 switch with all the other servers seems like a logical improvement).
All Fibre NIC's run at 1000Mbps Full Duplex.
(Auto-negotiated, although I have been unsuccessful thus far in turning this off and using forced settings - apparently, according to the driver help screen, the NC71xx will accept manual settings in NetWare but the NC61xx will ignore any of these settings)
Note that the existing servers all have V2.75 (24 August 2001) NIC drivers for the NC6134 and have worked without any glitch or error for a long time. (several years)
Checking of the server console statistics screen shows that there are no NIC errors at all - no bad packets, no jabber, no failed RX or TX bytes...
I had downloaded the latest NetWare Compaq/HP NIC drivers from HP (V7.51 Dec 2004) and applied it to the backup server only, at the same time as was installing the NC6134 NIC there. (ie. don't make multiple changes at a time, so all other servers untouched..)
The first overnight backup was marginally improved over the UTP interface, however, each subsequent night the performance degraded worse than the previous until a 2 hour incremental job (on 100Mb UTP) was taking 8 hours (on fibre) after three nights of incrementals.
Also, at this time any workstation connected to the backup server now seemed to intermittently freeze for 5-10 secs every 10 secs or so in any comms operations such as copying a large file via Windows Explorer or in accessing the ARCserve management console - for the latter the regular freeze would make inspecting the backup logs unworkable as would continually rescroll to the bottom of the log after each freeze episode despite the operator attempting to read further up the log, and for the former a straight copy of a file would effectively proceed at a crawl and then fail via a timeout (eg. a 1GB file would have a copy completion estimate of 20 mins in Windows but after 10+ minutes a "Network drive invalid" message would appear and the copy would fatally halt. In normal operation a 1GB file will copy in a few minutes server to server and not much slower to the desktop)
Watching the server performance graphs while attempting a large file copy as above showed that both the source and destination server were idling as though unaware that a workstation was trying to copy a file between them but every 10 seconds of so there would be a brief pounding spurt of traffic across the LAN then back to no activity again.
Unloading the card driver and reloading, on a few occasions, seemed to stop this behaviour each time (although nowhere near as definitely stable as switching back to the UTP interface, so I can only assume there is something happening related to this fibre NIC)
but another symptom has shown up when doing a weekly full backup overnight, where in the middle of backing up a volume on the main file server (ML570G1) across the LAN to the backup host, a sudden sequence of
"Failed to read directory ML570SERVER/Vol:dir/dir/dir/dir/file, errno=1, NetWareErrno=a" (also "errno=16") appears in the backup application log and the volume subsequently ends its backup prematurely - although as far as the backup application is concerned, without any error.
(In fact between 6GB and 100GB prematurely on the backup session where it has occurred!)
We do two full backups a week - after the first one failed midstream I replaced the latest NIC driver V7.51 with a slightly older version, V7.24, (August 2003), did a test full backup on the next night which also failed, but then a third full backup and three incrementals over the weekend immediately following did not fail even though there were no further changes made to any component of the infrastructre at all!?!
The time at which this failure occurred was indeterminate. Once at 4:10am, once at 6:40am (different source server too), once at 9:26pm.
Has anyone encountered this problem on proliant servers with these or similar cards, switches and/or software?
Is the source server disconnection related to the host server performance and freeze problems or a separate/coincidence?
Why would the old NIC driver be stable over a long period and the latest driver, which is far newer, be unreliable?
Note that figures I listed above are for the network backups. The local backup of the ARCserve server itself can backup files from 100MB/min to 2100MB/min depending on the size and number of files on its volumes, so there does not seem to be a problem with the backup application itself nor the SDLT tape library hardware attached to the server via SCSI HBA.
The fact that the tape-to-file Compares at the end of the backup sessions tend to run noticeably faster than the backup itself (200-2500MB/min typical) suggest that there are probably processor-bound issues coming into play with the overall throughput improvements that are achieveable as well.
The server has plenty of RAM so exhausting cache buffers doesn't seem likely. Watching the server console during a test backup shows that it is nowhere near breaking into a sweat at any point. It barely notices the disks being accessed and blocks going dirty are rare and quickly dealt with.
Ultimately though, if the fibre card can't be made to work 100% reliably then any performance improvements will have to be forgone.