[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: checkpointing architecture



Hi Folks,

	More comments on this thread...  I have also added the notion of what are
"services" versus what are "interfaces".  Specifically, I think we should
begin to distinguish clearly what services are central to CPR functionality
(across the board), and what features are optional (via their interfaces).

	Comments/questions invited.

Regards,
	Nathan.

 +-----------------------------+----------------------------------+
 | Nathan T.B. Stone, Ph.D.    | Pittsburgh Supercomputing Center |
 | Advanced Systems Group      | 4400 Fifth Avenue                |
 +-----------------------------+ Pittsburgh, PA 15213             |
 | mailto:nstone@psc.edu       | phone: 412-268-4367              |
 | http://www.psc.edu/~nstone/ | fax:   412-268-5832              |
 +-----------------------------+----------------------------------+
      PGP public key at: http://www.psc.edu/~nstone/pubkey.txt

> -----Original Message-----
> From: owner-gridcpr-wg@gridforum.org
> [mailto:owner-gridcpr-wg@gridforum.org]On Behalf Of Tom Goodale
>
> One things we could do is take each of the elements below, and
> try to come
> up with a few bullet-points as to why this needs to be part of the
> architecture, then we can perhaps decide which really belong or not, and
> see how things fit together.  I've tried to do this with some of them...
>
>  - AAA (accounting, charging etc.)
>
>     - if the checkpoint recovery is because of system failure, need to
>       refund wasted user hours

<specify as interface -- assume existing accounting infrastructure>

- Doug Thain's concern about user abuse is a valid point.  But I think it is
feasible to rely upon a system's internal integrity to track when it can or
cannot recover a run from checkpoint (and thereby knowing what was
"wasted").  If some institution creates a CPR infrastructure that is very
loose, then they ought to be willing to "eat" the cost of wasted hours, even
if users got something good from it.

>  - scheduling of resources

<specify as interface -- assume existing scheduling infrastructure>

- There should certainly be at least a generic interface for feeding a job
back into the queue, so that it can be re-started automatically.

>  - resource brokers

<specify as service -- central to CPR functionality>

- This should include the resources required to store/manage checkpoint
state (e.g. disk space and perhaps DB resources).
  * (If you get into job resources you're stepping out of the sandbox into
someone else's grass... ;-)

>  - data management for checkpoint files
>
>      - need to store them some place
>      - if the checkpoint can be restored on another machine, perhaps
>        should store on some data server so avaqilable if original machine
>        does not come up

<specify as service -- central to CPR functionality>

- great idea!
- more basically, a checkpoint system should have some kind of dynamic
configuration for specifying where checkpoint files should go.

> - checkpoint history (versioning for a single run)
>
>     - support rollbacks ot earlier times in the application run

<specify as service -- element of CPR state management>

- This is critical to internal integrity.  It is also central to determining
whether recovery is possible and therefore if any time has been "wasted"
(see earlier discussion).


> - other meta data (which?)

<specify as service -- element of CPR state management>

- user/group membership, for charging accounting info

>
> - application status monitoring

<specify as service -- central to CPR system functionality>

- ...or monitoring in general.  A complete CPR system should have some kind
of hardware (compute node) monitoring, to identify when nodes fail.


> - user job interface (->portals?)

<specify as interface -- enable users to trigger "events">

- we could provide an API into the system for user clients to feed signals
into the system.