[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FW: checkpointing architecture



Meant to reply to the list - finger trouble.

Stephen

-----Original Message-----
From: Stephen Pickles [mailto:stephen.pickles@man.ac.uk] 
Sent: 11 June 2003 18:58
To: 'Douglas Thain'
Subject: RE: checkpointing architecture



> -----Original Message-----
> From: owner-gridcpr-wg@gridforum.org
> [mailto:owner-gridcpr-wg@gridforum.org] On Behalf Of Douglas Thain
> Sent: 10 June 2003 17:39
> To: gridcpr-wg@gridforum.org
> Subject: Re: checkpointing architecture
> 
> 
> 
> >     - if the checkpoint recovery is because of system
> failure, need to
> >       refund wasted user hours
> 
> That's an interesting policy.

This policy is in force at PSC. CSAR, a UK national HPC service,
is considering adopting a similar policy.

The requirement for the architecture is that there must be some
way of flagging affected jobs, and provision for passing this
info to accounting systems. Relevant GGF WG's that are tackling
accounting head-on include UR-WG, RUS-WG, and GESA-WG.

> 
> Consider an application that is designed to run forever, and
> checkpoints using a private mechanism (such as writing to a 
> file) that is unknown to the accounting system.
> 
> Won't that give the user a free lunch?

Not really. At best the user will get the occasional rebate.
But it does incentivize users to checkpoint-enable their
applications. I for one believe that this constitutes good
practice (as we simulate bigger systems, scaling laws mean
that you usually need longer run-times _and_ more processors;
both of these make you more vulnerable to MTBF).

> 
> >      - if the checkpoint can be restored on another
> machine, perhaps
> >        should store on some data server so avaqilable if
> original machine
> >        does not come up   
> 
> Suggest that you read Jim Basney's work on checkpointing
> domains, which discusses assignment, naming, and migration 
> between checkpoint servers in a production system:
> 
> Jim Basney, Miron Livny, and Paolo Mazzanti, "Utilizing
> Widely Distributed Computational Resources Efficiently with 
> Execution Domains", Computer Physics Communications, 2000.
> http://www.cs.wisc.edu/condor/doc/cpc.ps
>
> Doug

Good tip!

Stephen

==================== Stephen M. Pickles ====================

Software Infrastructure Manager, e-Science team
Supercomputing, Visualization and E-Science
Manchester Computing
Room G49.1, Kilburn Building       stephen.pickles@man.ac.uk
The University of Manchester           tel: +44 161 275 5974
Oxford Road                            fax: +44 161 275 6800
Manchester M13 9PL                 http://www.sve.man.ac.uk/