Next ABP generation

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820
Topic 13810

Over the last weekend we tried to push up work generation for ABP2 such that we would process the data as fast as we get it from ARECIBO.

Many of you probably noticed the trouble our server infrastructure had keeping track of so many short running workunits (the critical time I'd expect to be about 1h).

To avoid these problems I'm currently working on 'bundling' every four current ABP2 workunits into single new ones. This means that for every 'next generation' ABP task your client would download four data files, process them one after the other, upload four result files, but report only a single task (that ran 4x as long as a current task, but would get 4x as much credit, too).

I'll try to make this backwards compatible to avoid yet another application (ABP3). If all goes well, a set of new App versions will be issued in the next days, that can process both current and next generation work. Behind the scenes I'll replace server-side daemons with versions that also can handle both kinds of results. So with luck the only things you'll notice of that change are new App versions and later longer running ABP2 workunits.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

Next ABP generation

The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.

We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.

The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: Bernd, You may want to

Quote:

Bernd,

You may want to take a look at this WU that seems to be having lots of problems.

All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.

They failed for different reasons. One is a CUDA-realted error, one is a 'too many exits' error which is probably related to the problem described here, and only one is a segv that actually accumulated computing time. If there are more of these very same errors at the same point of a workunit, then I'll start to worry.

We do have a webpage that monitors workunits that have collected only client (or validate) errors and no successful result. The notification level is set to 2 x , i.e. a WU needs to have 4 errors to show up there, and this level worked quite well.

And btw. what's the relation to the subject of this thread?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: Wouldn't it be simpler

Quote:
Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.

That's exactly what this bundling is trying to achieve, without the need to e.g. invent new file formats for input and output.

Quote:
Seti went through a similar issue and doubled their wu sensitivity.

Interesting. But I'm afraid that for the ABP search this wouldn't help us reach our goal. The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.

Quote:
One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

This refers to the innodb transaction log, right?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: By the way, can you

Quote:
By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right


I think the original ARECIBO data files (~4GB each) correspond to five minutes observation time. They are pre-processed (dedispersed) to each result in 628 workunits.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

As you may have noticed, the

As you may have noticed, the promised new app versions are out, and also the first longer running ABP2 workunits have been sent out, actually almost a thousand, somewhat more than I intended. It looks like these error out after 25% on CUDA Apps. So if you recently (between 15:00 and 18:30 UTC today) got a bunch of these new tasks that show up with a time estimate of 4x what you know from ABP2 tasks and they have been assigned to CUDA Apps (ABP2cuda23), feel free to abort them - they'll probably error out.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: As you may have

Message 3743 in response to message 3742

Quote:
As you may have noticed, the promised new app versions are out, and also the first longer running ABP2 workunits have been sent out, actually almost a thousand, somewhat more than I intended. It looks like these error out after 25% on CUDA Apps. So if you recently (between 15:00 and 18:30 UTC today) got a bunch of these new tasks that show up with a time estimate of 4x what you know from ABP2 tasks and they have been assigned to CUDA Apps (ABP2cuda23), feel free to abort them - they'll probably error out.

I just published new CUDA Apps (x.11) that should solve the problem with the new workunits. I'll keep an eye on it.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: RE: RE: As you may

Quote:
Quote:
Quote:
As you may have noticed, the promised new app versions are out, and also the first longer running ABP2 workunits have been sent out, actually almost a thousand, somewhat more than I intended. It looks like these error out after 25% on CUDA Apps. So if you recently (between 15:00 and 18:30 UTC today) got a bunch of these new tasks that show up with a time estimate of 4x what you know from ABP2 tasks and they have been assigned to CUDA Apps (ABP2cuda23), feel free to abort them - they'll probably error out.

I just published new CUDA Apps (x.11) that should solve the problem with the new workunits. I'll keep an eye on it.

BM


Here is a completed quorum where one of the first two tasks sent out on Feb 8 was a 'CUDA app failure'. The resend was sent to one of my hosts on Feb 9 and it has now been crunched, returned and validated with the other initial task. It took pretty much exactly 4 times the old task crunch time on that host (9033 secs as opposed to 4 x 2240 = 8960 secs) and was awarded 4 times the old credit, so all seems fine now that the 'CUDA issue' has been solved.

My most recent ABP2 downloads are still 'shorties' so how soon will you be sending out more of the larger tasks?

Yes, this is indeed a 'new generation' workunit, as you can see from the 160 granted credits.

It required some work over the past week to update all the backend components (workunit generator (WUG), validator, assimilator) to do deal with the new result structure - previously they relied on every result having only one file that has the name of the result. It was only today that I could verify that they are all working as they should, with both old and new tasks.

Since about 2h ago I have one instance of the WUG running continuously to crank out 157 (628/4) 'new generation' tasks every 70 minutes. There are still three WUG instances running that produce 3*628 'traditional' ABP2 WUs every 70 minutes. If all goes well, I'll stop generation of 'traditional' ABP2 work tomorrow, and then slowly ramp up 'new generation' workunit production over the next few days, keeping an eye on how the system behaves.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: If all goes well, I'll

Message 3745 in response to message 3744

Quote:
If all goes well, I'll stop generation of 'traditional' ABP2 work tomorrow, and then slowly ramp up 'new generation' workunit production over the next few days, keeping an eye on how the system behaves.

Our pre-processing machine crashed this morning with a hardware problem, so I guess ABP workunit generation will get delayed a bit.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: What percentage of

Quote:
What percentage of realtime are we at currently?

The ABP status page shows 133% of the data taking rate (that's what I'd call realtime) for the last 24h, but that includes the results that have been delayed from previous days during the changes that I had to make to the backend components. Averaged over the past week I'd expect it to be something like 2/3 (66%).

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: I guess one thing to

Quote:
I guess one thing to have an eye on is how BOINC cients are coping with the new units, scheduler-wise. Are the predicted runtimes foir the new units about right?

The flops estimation was multiplied by 4 wrt. that of a 'traditional' task, so the runtime prediction should be exactly as far off as it had been for four 'traditional' tasks.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.