--- Begin Message ---
- To: Paul Stodghill <stodghil@cs.cornell.edu>
- Subject: Re: [Paul Stodghill] Call for usecase for Grid Checkpoint/Recovery(GridCPR)
- From: "Yeom HeonYoung" <yeom@arirang.snu.ac.kr>
- Date: Fri, 31 Oct 2003 19:19:38 +0900
- In-reply-to: <r9965ikm4pw.fsf@barney.cs.cornell.edu>
- References: <r9965ikm4pw.fsf@barney.cs.cornell.edu>
- Reply-to: yeom@snu.ac.kr
- Xref: barney.cs.cornell.edu email.2003-10:1477
Hi Paul,
Here's my use case.
It might not be relevant to the current effort
since we are mostly interested system level checkpointing
and restart for the homogeneous Grid systems running
Linux and Globus supporting applications using MPICH-G2.
We are not considering heterogeneous systems at this stage.
Basically, what we support is user-transparent fault tolerance
for the users running MPI applications on homogeneous Grid systems.
All the users have to do is just re-linking their application
using the modified MPICH library.
What we provide is 1) a modified MPICH-G2, which we call MPICH-GF
where the message send/receive calls are wrapped to make the calls
safe for checkpointing and message logging. 2) a job management
system like DUROC which monitors the applications and sends
checkpoint signals periodically and restart the applications
if failed.
We provide coordinated checkpointing as well as independent
checkpointing with message logging.
You can find the details from http://dcslab.snu.ac.kr/projects/mpichgf/
We can use some facilities from GridCPR,
- health monitoring of application processes
- checkpoint storage
- job resubmission process
These are mainly management issues.
Somebody (or some process) should be in charge of all this
to make it work.
By the way, is there any support for the file I/O at all ?
Or it's up to the user as well ?
--
Heon Y. Yeom
Associate Professor
School of Computer Science and Engineering
Seoul National University
Seoul, 151-742, Korea
--- End Message ---