[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re; Workflow



I apologize if you received this message twice.
================================================================

Thanks Hugh! This is the first white paper coming from the GCE work
group. Looks like we are making a progress. And it is my understanding
that this draft is based on actual implementation, which is very
important. To keep the ball rolling, let me offer few comments.

1. Do we really want to use the term "workflow"? During our meeting in
  San Diego I got impression that majority was against it, as this
  term is being used in a somewhat different sense outside our
  community. Within my framework (Mississippi Computational Web
  Portal - MCWP) I use the term complex task. Even you use this term
  in your abstract: " (...) a standard for the sequencing of complex
  high-performance computational tasks within a Grid".

2. Is the "sequencing" not too restrictive? I am hoping for a standard
  that describes a computational graph (as AVS or other visualization
  packages often implement). Let's assume that the complex task is
  composed of "modules" or "atomic tasks". It seems to me that
  sequencing means processing one module after another. Can't we
  generalize it to more complex graph? Such as results of one modules
  can be feed to several modules running concurrently, or one module
  being feed with data coming from two or more modules (or being
  dependent on them in any other way)? Actually you admit this
  problem in section 6.


3. It seems to me you are suggesting an enumeration of "atomic tasks":
  computation, resource query, data transfer, ..., etc. Again, is it
  not too restrictive? Actually I have several problems with that.

  i. Can we hope to get a complete list of atomic tasks? Isn't it an
  invitation for nonstandard extensions?

  ii. Is it the purpose of this document to define terms such
  "computation" or "data transfer"? I think we should focus on
  describing how to compose a complex task from constituents, and not
  to describe how to process constituents.

  iii. What about capability of hierarchical composition of complex
  tasks? Say, we want to build a task from application A, B, and
  C. Now, each of them is performed is steps: identify resource, stage,
  compile, preprocess, run, etc. Hiding complexity in this case
  would be building the final task descriptor from task descriptors
  for each application (A,B,C), while these are built from atomic
  tasks such as compile, run, transfer data. Additional advantage of
  such an approach is that the task descriptor for, say, application
  B can be created by the domain specialist and may be then
  "published" so it can be reused by less savvy users. This is my
  experience working with Climate, Weather and Ocean modeling that
  setting a model to run involves much too many details for a
  physicist/oceanographer/meteorologis to fully comprehend. Defining
  a task descriptor for a particular model dramatically helps.

  What would be remedy? Instead of having an enumeration of tasks,
  let us introduce a generic terms: "atomic task" and "complex task".
  The atomic task contains a reference to the application descriptor
  (a GCE-WG white paper in preparation), and the complex task
  contains references to its constituents (atomic tasks).

  A simple example (just to illustrate idea, not to suggest syntax):
 
  atomic task:
  <task>
   <taskName name="aTask" descriptor="aDescriptor.xml" />
  </task> 

  complex task:
  <task>
   <taskName name="complexTask" />
   <task>
     <taskName name="aTask1" descriptor="aDescriptor1.xml" />
   </task> 
   <task>
     <taskName name="aTask2" descriptor="aDescriptor2.xml" />
   </task> 
  </task> 

  Note hierarchical/recursive definition of task: The <task> tag
  describes both a simple task and a complex task, the difference is
  in the tag attribute. Again, take it as a concept and not suggested
  syntax.

  What is missing here is relationship between the tasks. In my work
  I use the concept of a port (in the sense similar to what is in
  AVS). Each module (or task) define input ports and output
  ports. One compose a complex task by associating output ports with
  input ports. New information on the input port triggers processing
  of the module (can be .AND. or .OR., if more than one port), in a
  classical dataflow way. In the implementation I made a couple of
  years ago, an output port was an event fired by the module, and the
  input port is the method to be invoked (a particular event
  listener). Admittedly, this is very implementation specific, but I
  am sure that we can work out a more general model along this
  lines.

  This oo approach does not preclude working with legacy
  codes. First of all, my middle tier operates on application proxies
  and these are java object. Then, in the simplest case, a dusty
  fortran deck is represented by an object that has method run (input
  port) and fires event "done" (output port). If you bother to check
  return codes, you can easily fire event "failure", if this is the
  case. And you can do much more in such a paradigm.

  It is important to note, that unlike AVS, my ports does not
  represent data. I am not sending data from module to
  module. Instead I am sending events. You may envision a data
  transfer module for moving output of one module to another. But
  wait, there is more. I envision that the complex task descriptor is
  passed to a metascheduler that would be capable to optimize the
  task. As a byproduct the metascheduler can automatically determine
  the need of a file transfer, so the file transfer module is not
  needed at all! 

To summarize, I would not define the enumeration of atomic tasks, and
instead define a generic task that can be defined recursively. In
addition, instead of defining properties of the atomic tasks (such as
attributes of "Computation") but instead use references to application
descriptors. I feel very strongly that these belong to the another
GCE-WG white paper. Finally, I would recommend that we look at
defining relationships between tasks within a complex task.

I would appreciate your comments on my thought about complex task
descriptor. To be a little more constructive, I am appending  task
descriptors that I used in my work a couple years ago, and intend to
use in the near future while developing MCWP. Again, do not look at it
as the mature draft of the standard, but rather a bunch of ideas to be
considered.

Tomasz
 
 
 

Title: Untitled Document

ATD: Abstract Task Descriptor

version 1.01, January 2000 author: T. Haupt, Syracuse University

Introduction

A computational task requested by the user may involve many steps. Some steps can be performed concurrently, but typically there are data dependencies that force execution of the steps in some particular order. In many cases, it might be convenient to divide a particular step into smaller, "atomic" operations.

The task descriptor is abstract in a sense that it may not describe all resources needed for completion of the task. The final resolution and actual resource allocation is left to discretion of middle-tier services.

Atomic task

An atomic task is described by its

  • name
  • descriptor (Application Descriptor, AD),
  • input and output ports

Name of the task must be unique within the document. AD is a separate XML document that provides all necessary information on how to install and run the application, as well as input and output files. The task is submitted by invoking a method of the middle-tier proxy module. The method that can be used for submitting the task is called input port. Each task must define at least one input port. Upon completing the task (successful or not), the middle-tier proxy fires an event. Each event type signaling end of task processing is called output port. Each task must define at least one output port.

Example of an atomic task:
<Task>
     <TaskName name="task1" descriptor="task1.xml">
     <InputPort method="run" />
     <OutputPort event ="done">
</Task>

Building complex tasks from atomic tasks

Atomic tasks can be grouped together to form a complex task. Optionally, their dependency can be defined by creating a computational graph. The graph is constructed by connecting output and input ports of atomic tasks. This means that an event fired by one task (output port) will cause invocation of a method of the other task (input port).

As in the case of the atomic task, a complex task must define at least one input and one output ports. However, instead of specifying events and methods, input and output ports from the constituent tasks are used to define ports of the complex task.

Example of a complex task built from atomic tasks:

<Task>
    <TaskName name="ComplexTask" />
    <Task>
         <TaskName name="atomic_task1" descriptor="task1.xml" />
         <InputPort method="run" />
         <OutputPort event ="done" />
    </Task>

    <Task>
         <TaskName name="atomic_task2" descriptor="task2.xml" />
         <InputPort event="run" />
         <OutputPort method ="done" />
    </Task>

    <connection>
        <output task="task1" />
        <input task="task2" />
    </ connection>
    <InputPort task="atomic_task1" />
    <OutputPort task="atomic_task2" />
</Task>

In this example, event "done" fired by atomic_task1 will result in invoking method run of atomic_task2. The complex task can be submitted by invoking input port of atomic_task1 (bacause of  <InputPort task="atomic_task1" />), and event done of atomic_task2 task will signal completion of the complex task.

Hierarchy of tasks

Complex tasks can be grouped and connected to build an arbitralily deep hierarchy of tasks.

<Task>
<TaskName name="example_task">

<Task>
   <TaskName name="A">
    <Task>
          <TaskName="A1" descriptor="A1.xml" />
         <InputPort method="run" />
         <OutputPort event="done" />
    </Task>
   <Task>     
          <TaskName="A2" descriptor="A2.xml" />
          <InputPort method="run" />
          <OutputPort event="done" />
    </Task>
    <connection>
        <output task="A1" />
        <input task="A2" />
    </connection>
    <OutputPort application="A" event="done" />
</task>

<Task>
    <TaskName name="B" descriptor="B.xml">
    <InputPort method="run" />
    <OutputPort application="B" event="done" />
</Task>

<Task>
    <TaskName name="C" descriptor="C.xml" />
    <InputPort application="C" method="run" />
   <OutputPort application="C" event="done" />
</Task>

<Task>
    <TaskName name="D">
    <Task>
        <TaskName name="D1" descriptor="D1.xml" />
       <InputPort method="run" />
       <OutputPort event="done" />
    </Task>
    <Task> 
        <TaskName name="D2" descriptor="D2.xml" />
       <InputPort method="run" />
        <OutputPort event="done" />
   </Task>
     <connection>
        <output task="D1" />
        <input task="D2" />
    </connection>
    <InputPort task="D1" />
    <OutputPort task="D2" />
</Task>

<connection>
    <output task="A" />
    <output task="B" />
    <input task="C" />
</connection>

<connection>
    <output task="C" />
    <input task="D" />
</connection>

</Task>

More on connecting tasks

The example of a complex task above show follow a simple dataflow paradigm. Actually, the model presented here is more general. A proxy module representing an atomic task can define more than one input and output port. For example, the module can fire two types of events: one signaling a successful completion of the task, the other failure. Hence, a different action can be defined depending on the outcome of processing the task (a different task or a different method of the same task). Note, that submission of a task is more than submitting a job. It may involve selecting of host, file transfers, database access, mass storage access, compilation, setting environmet variables, generating batch scripts, generating Globus RSL strings, and more. Different methods of the proxy module may implement different procedures for preparing a job for submission and/or postprocessing - none of those require any modifications of the code to be run at the back end.

Connecting modules by matching events and methods allows also for constructing loops: completion of one task results in submission of the other untill some stopping criteria are satisfied (say, all input files are processed). If the back-end code is capable of setting flags at runtime, the flags can be used for generating custom events, which in turn can be used for ansynchronous communications, or even message passing, between concurrent tasks (lattency permitting).

At this time, we have no mechanisms of specifying high performance connections between tasks representing tightly coupled codes.

If the task defined more than one input port, there is a potential ambiguity when to submit the task: when at least one or all events are fired. We follow the conventions that all events defined within a single <connection> tag have AND relationship, while event defined in different <connection> tags are OR related.

Examples:

<connection>
    <output task="a" />
    <output task="b" />
    <input task="c" />
</connection>


in the above example both a AND b task must complete to trigger task c

<connection>
    <output task="a" />
    <input task="c" />
</connection>
<connection>
    <output task="b" />
    <input task="c" />
</connection>

while here completion of either a OR b will result in submitting task c

Since each task may define more than one input and output ports, <input> and <output> tags within <connection> tag have optional attributes, method and event, respectively to reslove possible ambiguities, as shown in the example below:

Example (multiple ports):

<Task>
    <TaskName name="ComplexTask" />
    <Task>
         <TaskName name="task1" descriptor="task1.xml" />
         <InputPort event="run" />
         <OutputPort method ="done" />
        <OutputPort method="failure" />
    </Task>

    <Task>
         <TaskName name="task2" descriptor="task2.xml" />
         <InputPort event="run" />
         <OutputPort method ="done" />
    </Task>
    <Task>
         <TaskName name="task3" descriptor="task3.xml" />
         <InputPort event="run" />
         <OutputPort method ="done" />
         <OutputPort method="restart" />
    </Task>

    <connection>
        <output task="task1" event="done" />
        <input task="task2" />
    </ connection>
    <connection>
        <output task="task1" event="failure" />
        <input task="task3" method="restart" />
    </ connection>
    <InputPort task="atomic_task1" />
    <OutputPort task="atomic_task2" />
</Task>

In this example, if task1 fires event "done" then method "run" of task2 is invoked. Otherwise, method "restart" of task3 is submitted.

ATD.dtd

<!ELEMENT Task (TaskName, (Task|connection)*, InputPort+, OutputPort+>
<!ELEMENT TaskName EMPTY>
<!ATTLIST TaskName
          name CDATA #REQUIRED
         descriptor CDATA #IMPLIED>
<!ELEMENT connection (output+,input+)>
<!ELEMENT output EMPTY>
<!ATTLIST output
task CDATA #REQUIRED
event CDATA #IMPLIED>
<!ELEMENT input EMPTY>
<!ATTLIST input
task CDATA #REQUIRED
method CDATA #IMPLIED>
<!ELEMENT InputPort EMPTY>
<!ATTLIST InputPort
          task CDATA #REQUIRED>
<!ELEMENT OutputPort EMPTY>
<!ATTLIST OutputPort
         task CDATA #REQUIRED>