pending credit

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820
Topic 13718

I'm afraid that the main reasons for the sometimes large amount of pending credits lies in our "monster crunchers": ATLAS and the two GRID accounts (eScience and Britta Daudert).

ATLAS' normal usage is quite bursty and barely predictable, and so is it's contribution to Einstein@home. This would suggest a small work cache size, but I found this to stress our scheduler too much in times when ATLAS runs Einstein@home almost exclusively (on >50% of the nodes). The current cache size of 4 days keeps the project healthy, but leads to a lot of tasks held "captive" when the normal load rises and ATLAS is not running Einstein@home.

The GRID jobs run for a fixed number of seconds, and thus they do request work for only this amount of time. The scheduler will give them the least amount of work that runs larger than the requested duration. With the large number of machines there will be many tasks hanging over the cliff that will never be completed after the "job" terminates. In addition, BOINC's duration prediction isn't perfect to begin with, and the job scheduling is something that BOINC was not designed for, making it even worse.

For small (number of) machines you probably wouldn't notice, but these accounts have literally thousands of machines and are the greatest single contributions to Einstein@home, so any fluctuation there is highly noticeable to the whole project. You probably saw the pendings rise (and the oldest unsent result getting older) when we took ATLAS of E@H completely for a while.

If anybody knows a reasonably easy solution to lower the pendings that arise from the facts stated above I'd do my best to implement it.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

pending credit

Quote:
With regard to the Einstein issue: am I right in thinking that the major problem, especially with the GRID nodes, is that once the allocated number of seconds has passed, the virual machine is effectively wiped clean, and starts afresh the next time it is tasked with an Einstein time-slice? As contrasted with an ordinary home PC, which preserves its state and data files between runs? If so, the only solution would seem to be a mechanism for sweeping Einstein 'work-in-progress' files off to a server or NAS box before the time runs out, and retrieving them at the start of the next run.


Honestly I'm not sure how they manage their hostids and if this could be easily implemented. A task reported from a different host than it was previously assigned to would not be accepted by the server.

I'm not sure that this helps, but I'll propose to them to send the client a message to abort all tasks in progress or detach from the project before the job actually terminates. This will at least report the tasks as client errors and create a new unsent task immediately instead of waiting for the deadline timeout.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.