[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: checkpointing architecture
Hi Folks,
More comments on this thread... I have also added the notion of what are
"services" versus what are "interfaces". Specifically, I think we should
begin to distinguish clearly what services are central to CPR functionality
(across the board), and what features are optional (via their interfaces).
Comments/questions invited.
Regards,
Nathan.
+-----------------------------+----------------------------------+
| Nathan T.B. Stone, Ph.D. | Pittsburgh Supercomputing Center |
| Advanced Systems Group | 4400 Fifth Avenue |
+-----------------------------+ Pittsburgh, PA 15213 |
| mailto:nstone@psc.edu | phone: 412-268-4367 |
| http://www.psc.edu/~nstone/ | fax: 412-268-5832 |
+-----------------------------+----------------------------------+
PGP public key at: http://www.psc.edu/~nstone/pubkey.txt
> -----Original Message-----
> From: owner-gridcpr-wg@gridforum.org
> [mailto:owner-gridcpr-wg@gridforum.org]On Behalf Of Tom Goodale
>
> One things we could do is take each of the elements below, and
> try to come
> up with a few bullet-points as to why this needs to be part of the
> architecture, then we can perhaps decide which really belong or not, and
> see how things fit together. I've tried to do this with some of them...
>
> - AAA (accounting, charging etc.)
>
> - if the checkpoint recovery is because of system failure, need to
> refund wasted user hours
<specify as interface -- assume existing accounting infrastructure>
- Doug Thain's concern about user abuse is a valid point. But I think it is
feasible to rely upon a system's internal integrity to track when it can or
cannot recover a run from checkpoint (and thereby knowing what was
"wasted"). If some institution creates a CPR infrastructure that is very
loose, then they ought to be willing to "eat" the cost of wasted hours, even
if users got something good from it.
> - scheduling of resources
<specify as interface -- assume existing scheduling infrastructure>
- There should certainly be at least a generic interface for feeding a job
back into the queue, so that it can be re-started automatically.
> - resource brokers
<specify as service -- central to CPR functionality>
- This should include the resources required to store/manage checkpoint
state (e.g. disk space and perhaps DB resources).
* (If you get into job resources you're stepping out of the sandbox into
someone else's grass... ;-)
> - data management for checkpoint files
>
> - need to store them some place
> - if the checkpoint can be restored on another machine, perhaps
> should store on some data server so avaqilable if original machine
> does not come up
<specify as service -- central to CPR functionality>
- great idea!
- more basically, a checkpoint system should have some kind of dynamic
configuration for specifying where checkpoint files should go.
> - checkpoint history (versioning for a single run)
>
> - support rollbacks ot earlier times in the application run
<specify as service -- element of CPR state management>
- This is critical to internal integrity. It is also central to determining
whether recovery is possible and therefore if any time has been "wasted"
(see earlier discussion).
> - other meta data (which?)
<specify as service -- element of CPR state management>
- user/group membership, for charging accounting info
>
> - application status monitoring
<specify as service -- central to CPR system functionality>
- ...or monitoring in general. A complete CPR system should have some kind
of hardware (compute node) monitoring, to identify when nodes fail.
> - user job interface (->portals?)
<specify as interface -- enable users to trigger "events">
- we could provide an API into the system for user clients to feed signals
into the system.