Over the last weekend we tried to push up work generation for ABP2 such that we would process the data as fast as we get it from ARECIBO.
Many of you probably noticed the trouble our server infrastructure had keeping track of so many short running workunits (the critical time I'd expect to be about 1h).
To avoid these problems I'm currently working on 'bundling' every four current ABP2 workunits into single new ones. This means that for every 'next generation' ABP task your client would download four data files, process them one after the other, upload four result files, but report only a single task (that ran 4x as long as a current task, but would get 4x as much credit, too).
I'll try to make this backwards compatible to avoid yet another application (ABP3). If all goes well, a set of new App versions will be issued in the next days, that can process both current and next generation work. Behind the scenes I'll replace server-side daemons with versions that also can handle both kinds of results. So with luck the only things you'll notice of that change are new App versions and later longer running ABP2 workunits.
BM
BM

Next ABP generation
)
The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.
We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.
The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.
BM
BM
RE: Bernd, You may want to
)
They failed for different reasons. One is a CUDA-realted error, one is a 'too many exits' error which is probably related to the problem described here, and only one is a segv that actually accumulated computing time. If there are more of these very same errors at the same point of a workunit, then I'll start to worry.
We do have a webpage that monitors workunits that have collected only client (or validate) errors and no successful result. The notification level is set to 2 x , i.e. a WU needs to have 4 errors to show up there, and this level worked quite well.
And btw. what's the relation to the subject of this thread?
BM
BM
RE: Wouldn't it be simpler
)
That's exactly what this bundling is trying to achieve, without the need to e.g. invent new file formats for input and output.
Interesting. But I'm afraid that for the ABP search this wouldn't help us reach our goal. The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.
This refers to the innodb transaction log, right?
BM
BM
RE: By the way, can you
)
I think the original ARECIBO data files (~4GB each) correspond to five minutes observation time. They are pre-processed (dedispersed) to each result in 628 workunits.
BM
BM
As you may have noticed, the
)
As you may have noticed, the promised new app versions are out, and also the first longer running ABP2 workunits have been sent out, actually almost a thousand, somewhat more than I intended. It looks like these error out after 25% on CUDA Apps. So if you recently (between 15:00 and 18:30 UTC today) got a bunch of these new tasks that show up with a time estimate of 4x what you know from ABP2 tasks and they have been assigned to CUDA Apps (ABP2cuda23), feel free to abort them - they'll probably error out.
BM
BM
RE: As you may have
)
I just published new CUDA Apps (x.11) that should solve the problem with the new workunits. I'll keep an eye on it.
BM
BM
RE: RE: RE: As you may
)
Yes, this is indeed a 'new generation' workunit, as you can see from the 160 granted credits.
It required some work over the past week to update all the backend components (workunit generator (WUG), validator, assimilator) to do deal with the new result structure - previously they relied on every result having only one file that has the name of the result. It was only today that I could verify that they are all working as they should, with both old and new tasks.
Since about 2h ago I have one instance of the WUG running continuously to crank out 157 (628/4) 'new generation' tasks every 70 minutes. There are still three WUG instances running that produce 3*628 'traditional' ABP2 WUs every 70 minutes. If all goes well, I'll stop generation of 'traditional' ABP2 work tomorrow, and then slowly ramp up 'new generation' workunit production over the next few days, keeping an eye on how the system behaves.
BM
BM
RE: If all goes well, I'll
)
Our pre-processing machine crashed this morning with a hardware problem, so I guess ABP workunit generation will get delayed a bit.
BM
BM
RE: What percentage of
)
The ABP status page shows 133% of the data taking rate (that's what I'd call realtime) for the last 24h, but that includes the results that have been delayed from previous days during the changes that I had to make to the backend components. Averaged over the past week I'd expect it to be something like 2/3 (66%).
BM
BM
RE: I guess one thing to
)
The flops estimation was multiplied by 4 wrt. that of a 'traditional' task, so the runtime prediction should be exactly as far off as it had been for four 'traditional' tasks.
BM
BM