Congrats E@H team!

Anonymous
Topic 13512

Quote:

Well done !

Great job in rebuilding the raid6 array.

How come 2 disks crash simultaneously ? defective power supply ?

BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)

Since this forum seems to attract tech junkies, here are the gory details:

Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.

Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.

The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.

In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.

We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

Congrats E@H team!

Quote:
Quote:

Lift jaw, slap face, wipe drool ... :-)

I'll second that! Sure makes the array I put together a few weeks ago look sad!

OK, hardware junkies. We have a webcam in our cluster room and you can see the Einstein@Home racks in real time.

The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).

I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: RE: I just put the

Quote:
Quote:
I just put the URL inside Winamp, Quicktime, or other player that can play .asf files (Media Player hated it).

My QuickTime Player just says:

Alert

The URL is not valid.

Mac OS v10.4.8, QT v7.1.3.

On my mac, I use a tool called djoPlayer and use the 'Open URL' menu item with http://nemocam.phys.uwm.edu/img/video.asf.

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: Dear Bruce, Thanks for

Quote:

Dear Bruce,
Thanks for satisfying some of our curiosity. By the way,
-1- what is the somewhat mobile white object in the lower right corner of the view,
-2- do operators occasionally show up in the field of view.
Sorry to bother you with such unimportant questions. With best regards, Jean.

The white object is the corner of an LCD monitor. It's tied down to a small Ikea table with wheels, and there is also a keyboard tied down to the table. We call it the 'crash cart'. Typically if a machine is having problems, we will roll the crash cart over to it, and plug in the keyboard and monitor to help fix it.

Yes, from time to time you will see David Hammer and others working on the computers. David and I are still having some problems with our main file server, and I think we may switch machines to the backup file server to try and fix these problems. If that happens, the project will go offline for a while and you'll see David working on the racks.

Cheers,
Bruce

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.