Handling of power-outage crashes?

Anonymous
Topic 13594

The Einstein@home App writes two files: a (temporary) output file and a checkpoint file.

The checkpoint file contains information about some internal states and about the temporary output file, too (e.g. a checksum). The Einstein@Home Application writes the checkpoint file to a temporary file and then renames it, overwriting a possible previous checkpoint file in single, "atomic" operation. This way there's always a checkpoint it recovers from, which might, in rare cases, be a previous one. There is, however, no way to take any influence on how "atomic" the rename operation actually is within the operating system.

The output file is normally only appended to. There is a certain operation, however, which we call "compacting", which is automatically done when the file grows larger than a certain limit. Then the output is rewritten (to a temporary file first, then renamed to the original name, overwriting the older file). You'll see a message "Compacting toplist..." in stderr output. This usually only happens a few times during a run, and less frequently with time.

In case of "compacting" the output file it might happen that for a short time the checkpoint does not reflect the state of the temporary output file. Being interrupted between having written a new output file and a new checkpoint causes an inconsistent state on disk. It would be possible to avoid this by first deleting the checkpoint file, so if the App gets interrupted before having written a new one it would start over from the beginning. However as the time already spent on this task would be wasted anyway in this case, and as it bears the additional risk of having Tasks that endlessly restart from the beginning, I didn't take that option.

It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.

You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.

BM