Well done !
Great job in rebuilding the raid6 array.
How come 2 disks crash simultaneously ? defective power supply ?
BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)
Since this forum seems to attract tech junkies, here are the gory details:
Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.
Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.
The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.
In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.
We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!
Cheers,
Bruce

Congrats E@H team!
)
OK, hardware junkies. We have a webcam in our cluster room and you can see the Einstein@Home racks in real time.
The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).
I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!
Cheers,
Bruce
RE: RE: I just put the
)
On my mac, I use a tool called djoPlayer and use the 'Open URL' menu item with http://nemocam.phys.uwm.edu/img/video.asf.
Cheers,
Bruce
RE: Dear Bruce, Thanks for
)
The white object is the corner of an LCD monitor. It's tied down to a small Ikea table with wheels, and there is also a keyboard tied down to the table. We call it the 'crash cart'. Typically if a machine is having problems, we will roll the crash cart over to it, and plug in the keyboard and monitor to help fix it.
Yes, from time to time you will see David Hammer and others working on the computers. David and I are still having some problems with our main file server, and I think we may switch machines to the backup file server to try and fix these problems. If that happens, the project will go offline for a while and you'll see David working on the racks.
Cheers,
Bruce