[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Call for usecase for Grid Checkpoint/Recovery (GridCPR)



The scenarios for Grid checkpointing that we have considered and we
are trying to support in the context of the EDG (Euopean DataGrid)
project have clearly the goal of trying to be resilient to failures,
avoiding, in these cases of failures, to rerun jobs from the beginning,
therefore without losing the processing done till the failure, and
therefore optimizing (not wasting) the usage of the resources.
This is in particular important for long running jobs, as it is the
case for many HEP (High Energy Physics) applications that we
have to "support" (some of them run for many hours/many days).


The idea of the EDG Grid Checkpointing is that it is up to the user to
define what is the state of the job (it must be "enough" to be able to
restart the job from a previosuly saved state) and it is up to him to
save from time to time an intermediate state.
The application must be instrumented also to be able to start the
computation from a previously saved state.
This is all done instrumenting the code with the proper EDG Grid
Checkpointing APIs (details on these APIs have already been circulated
in this maiuling list).


Given this framework we now support two main use-cases:

- A job, instrumented with the Grid checkpointing APIs, runs on a
  computing resource, and it saves from time to time the intermediate
  results (i.e. its intermediate state).
  Let's suppose that a "Grid problem", i.e. a problem external to the
  job (e.g. a failure in the computing resource where the job
  was running) occurs.
  If the Grid middleware is able to detect the failure, the EDG
  Workload Management System automatically (of course if the user has
  enabled this option) reschedules the job and resubmits the job to an
  other (possibly different, if any) compatible resource, and when
  the job restart its execution, the last saved state is retrieved,
  and therefore the computation can be restarted from that point, and not
  from the beginning.


- Let's assume that a failure (of any kind) happens to a job
  (instrumented with the Grid Checkpointing APIs) running on a
  computing resource, but this failure is not detected by the Grid
  middleware (it is not easy at all to detect all possible failures,
  and being able to act according to the specific type of problem).
  In these case we allow the user to retrieve a saved state for his
  job (usually user is interested to retrieve the last saved state),
  and can then resubmit the job (in case after having modified/fixed
  something e.g. in the specification of the job requirements),
  specifying that the job must start not from the beginning, but from
  the specified state (i.e. the just retrieved state, saved in the
  previous run for the job).


An other scenario where job checkpointing takes place in our
EDG environment is job partitioning.
The idea is that a job can be partitioned in sub-jobs, which can be
executed in parallel. Then a "job aggregator" is responsible to
collect the results of these sub-jobs (represented by their "final"
states) and provides the overall results.



We also plan to exploit Grid Checkpointing in scenarious of job
preemption, i.e. it might be necessary to "vacate" jobs from a certain
resource for a certain reason (e.g. because that machine must be used
to run an other job with higher priority), but we don't yet support
this functionality.



More details on the EDG Grid Checkpointing can be found at:

https://edms.cern.ch/document/347730

I will have to review this doc in the next days, but there shouln't be fundamental
changes in the basic concepts.


					Cheers, Massimo






             \\\|///
            \\ ~ ~ //
            (/ @ @ /)
   -------oOOo-(_)-oOOo----------------------------------
                              Massimo Sgaravatto
                              INFN Sezione di Padova
                              Via Marzolo, 8
                              35131 Padova - Italy
                              Tel: ++39 0498277047   Fax: ++39 0498277102
          oooO                E-mail: massimo.sgaravatto@pd.infn.it
          (   )   Oooo        Home page: http://www.pd.infn.it/~sgaravat
   --------\ (----(   )----------------------------------
            \_)    ) /
                  (_/





On Mon, 20 Oct 2003, Paul Stodghill wrote:

> At GGF9 in Chicago, the GridCPR Working Group decided to produce a
> document that describes the usecases for checkpoint and recovery in a
> Grid environment. This document will be submitted to the GGF for
> discussion at GGF10 and will eventually become a GWD-I document. The
> purpose of this document is to provide a focus for the
> checkpoint/recovery API that will be proposed in the near future.
>
> This message is a call for usecases to be included in the document!
>
> If you are interested in providing a usecase for the working group,
> please do the following,
>
> 1. Read the working group's charter at,
>
>         http://gridcpr.psc.edu/GGF/charter/GridCPR-WG-charter.1.1.txt
>
>    In particular, the GridCPR working is charged with defining,
>
> 	"a user-level API and associated layer of services that will
> 	permit checkpointed jobs to be recovered and continued on the
> 	same or on remote Grid resources."
>
>    Our focus is on
>
> 	"recoverability of jobs among heterogeneous Grid resources"
>
>    This means, for instance, that we are not considering system-level
>    checkpoint/recovery schemes (a la Condor) at this time.
>
>    If you are not sure whether your usecase falls under this working
>    group's charter, please submit it anyway!
>
> 2. Write up your usecase(s). We are not providing a template, but
>    we ask that you please be sure to highlight,
>
>     - How you would use GridCPR in your scenario, and
>     - What makes your scenario difficult for GridCPR.
>
> 3. Submit your usecases to the GridCPR mailing list
>    <gridcpr-wg@gridforum.org> or to the usecase document editor (Paul
>    Stodghill <stodghil@cs.cornell.edu>).
>
>    Please submit your usecase by October 31, 2003.
>
> Here is the current schedule for the production of this usecase
> document,
>
> - October 31, 2003 - Usecase submission deadline.
> - November 30, 2003 - First draft submitted to GridCPR mailing list for discussion.
> - January 15, 2004 - Draft completed and submitted to GGF
> - March 2004 - Discussion of draft at GGF10 in Frankfurt, Germany.
> - Post GGF 10 - Revise document and submit to GGF Document Editor for
> 			final approval.
>
> If you have any questions or comments, please feel free to get in touch
> with me. Thanks.
>
> Paul Stodghill <stodghil@cs.cornell.edu>
> Deptartment of Computer Science
> 4128 Upson Hall
> Cornell University
> Ithaca, NY  14853
> USA
>