[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
gridcpr use case
The use case for grid checkpoint recovery from the GridLab project is
as follows:
A job, consisting of one or more processes, is running on a grid machine.
In the middle of the run, the job may be forced to migrate to a different
machine, possibly with a different architecture and/or number of CPUs.
The application program may either decide by itself to migrate (e.g. poor
performance on the current machine) or may be forced to do so, either by
the user (via an application manager) or by the local resource management
software that wishes to evict the job.
The main purpose of Grid cpr in GridLab thus is the ability to interrupt and
migrate a job until it finally terminates. Fault-tolerance is only a secondary
aspect.
An extension of the above use case is dealing with jobs that run in parallel
on multiple grid sites.
Thilo
--
Thilo Kielmann http://www.cs.vu.nl/~kielmann/