'Fstat.out.ckp' not found after a workunit has been finished

Anonymous
Topic 13387

Quote:

The following log items are all from a single stderr.txt file in a crunching slot with my notes and questions in bold:

2006-07-19 08:53:17.9638 [normal]: Search finished successfully.

I restarted my computer on 10:11, then..

2006-07-19 10:11:19.0625 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-19 10:11:19.1562 [normal]: Started search at lalDebugLevel = 0
2006-07-19 10:11:23.6875 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-19 10:11:23.6875 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1

So...why the checkpoint file would be lost and then this WU has to start from beginning? I've encountered this situation a few times and any explanations would be appreciated since many hours of crunching has been wasted.

Please confirm that ALL of the above is from a SINGLE stderr.txt file in the slots/N/ directory.

This is strange -- it could indicate some problem in our code -- I'll have someone take a closer look at it. Does this happen repeatedly or was this a 'one-time' occurence?

Bruce

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

'Fstat.out.ckp' not found after a workunit has been fi

Pretty strange. Which Core Client are you using?

For the first shot I would think of a problem with access rights of the directories under BOINC. Are you running BOINC as a service, or as a different user than who installed it?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: I just found the

Quote:

I just found the following APIs in this page:

Quote:


Critical sections

void boinc_begin_critical_section();
void boinc_end_critical_section();

Call these around code segments during which you don't want to be suspended or killed by the core client. NOTE: this is done automatically while checkpointing.

So the problem, as I supposed, should lie in the science application?

YG

We actually asked David Anderson about that, and he told us what the NOTE above says (it might even have been inserted after our question).

Anyway - I'll have another look into the code. Might be some cleanup after removing the checkpoint takes long enough that it should be treated as a critical section.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.