[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Call for usecase for Grid Checkpoint/Recovery (GridCPR)
> -----Original Message-----
> From: owner-gridcpr-wg@gridforum.org
> [mailto:owner-gridcpr-wg@gridforum.org]On Behalf Of Paul Stodghill
>
> 2. Write up your usecase(s). We are not providing a template, but
> we ask that you please be sure to highlight,
>
> - How you would use GridCPR in your scenario, and
> - What makes your scenario difficult for GridCPR.
Hi Paul, et al.
I'll just get right to it. Here is my GridCPR use case. I would emphasize
that this is our intention, but might require modulation by the final
design:
We would use GridCPR to:
- Provide automated fail-over for compute jobs, following hardware failure.
Specifically to:
* "notice" when a compute node has failed
* determine whether the failed node was running a user's job when it
failed
* re-schedule/re-run the user's job on sufficient, available resources
+ marshall the data/checkpoint files necessary for this
* refund any/all non-recovered computation time
- Provide manual execution control/migration of compute jobs.
Specifically to:
* provide users with an interface to manually halting a running job
* provide the infrastructure for migrating checkpoint/data files to
a different resource, either within or without the original host
* provide an interface for manually restarting jobs from checkpoint
What's hard about this is:
- Users must write robust checkpoint code.
- Compute resources must provide a means for distributing files within a
compute resource (e.g. LeMieux.psc.edu has compute nodes with node-local
file systems. So migration must factor this in...)
- Checkpoint files must be written in such a way as to be readable on other
platforms -- for the "Grid" in GridCPR.
- Scheduling systems must support queries from the CPR system, to allow the
CPR system to ask "what happened to this checkpointed job", "where was
the job running", and "where will this job be restarted", etc.
- Scheduling systems must support automated job queueing instructions from
an external source (the CPR system).
- Compute resources must have some means for identifying if "something
died" (event handling).
If we can do that, we'll all be rich and famous. :-)
Cheers,
Nathan.
+-----------------------------+----------------------------------+
| Nathan T.B. Stone, Ph.D. | Pittsburgh Supercomputing Center |
| Advanced Systems Group | 4400 Fifth Avenue |
+-----------------------------+ Pittsburgh, PA 15213 |
| mailto:stone@psc.edu | phone: 412-268-4367 |
| http://www.psc.edu/~nstone/ | fax: 412-268-5832 |
+-----------------------------+----------------------------------+
PGP public key at: http://www.psc.edu/~nstone/pubkey.txt