The Internet (and I suspect the Grid) works by having very little state in the middle.
This gives a scalable and fault-tolerant design -- and is a VERY simple design pattern.
All the state is in the client or the database (beans get persistence by saving state to a DBMS or file system).
It is sometimes a pain to design for this "queue-oriented" "loosely coupled" world
-- but the resulting designs are generally more scalable than connection-oriented schemes.
HTTP and SMTP follow this model: they are stateless.
This is the emissary - fiefdom model.
All state in the "middle" is soft.
Anyway, the .NET folks observe that a DataSet is an answer to a question.
It is self-consistent (transactional) but transactions do not span questions.
If the client wants to persist the dataset, fine!
But, it is emissary state (a stale copy of the fiefdom's data).
They provide diffgrams (optimistic concurrency control) if the client wants to update sets or make inserts or deletes.
But, being mostly web-centric they encourage you to use web methods to update objects (transfer funds).
This should be a familiar song to the EJB folks.
Transactions are fine within a service,
but operations that span multiple autonomous services are probably best done with WS-Transaction sagas with compensation
driving queue-oriented services.
That is the mindset that motivates the design of the .NET dataset.
This is one place where the ODBC model (and the file open/read/write model) is VERY different than the internet model.
ODBC tightly couples client and server.
You can look at WebDAV to see how the internet-folks do file access (similar ideas are in CIFS/SAMBA operations packages).
There are sometimes leases (think shopping carts), but no real connections.
You may be skeptical that large or complex apps can be built in this way.
If so, I encourage you to look at IBMs IMS (it is a queue driven system), or look at .NET apps like SkyQuery.
Scalable designs require loose coupling and only soft state in the middle.
Much of the brittleness of some current middleware stems from forgetting this hard-learned lesson.
-----Original Message-----
From: Simon Laws [mailto:simon_laws@uk.ibm.com]
Sent: Wednesday, July 02, 2003 12:34 PM
To: Jim Gray
Cc: dais-wg@gridforum.org; Jim Gray; szalay@pha.jhu.edu; Tamas Budavari; Maria A. Nieto-Santisteban
Subject: Re: DAIS Data Sets & Transformations
Hi Jim
I'm becoming attached to "data set" so I agree but we have, to date, combined data identification with data access into data set and this is causing confusion.
So, when we considered data set originally it was aligned with the "part of the file you asked for" idea that you suggest below. However many people see it simply as a mechanism for accessing data represented as a data resource. I.e. there is no implication of caching in a data set. In fact we must provide both of these functions but we haven't yet achieved a consensus on what component is responsible for what.
Malcolm generated a list of orthogonal properties that could be applied to a data set as it stands, for example, data type, materialization policy (on-demand, eager..), security policy, delivery policy (synchronous, asynchronous), unit of access (all at once, iteration), lifetime ( use once, use many) etc. This gives us a challenge as to how our interfaces should be factored out. For those services that represent data it is natural to build a hierarchy of interfaces based on data type. So, if data set represents data, you can imagine the hierarchy you suggest:
Data Set
file
xml doc
ODBC rowset
.NET style dataset
cube
HDF
FITS
VOtable
CSV
If data set were simply representing access to data resources a different hierarchy of operations and properties emerges, for example,
Data Set
Materialization
On demand
Eager
Parallel
AccessMode
Pull
PushToOne
PushToSubscribers
AccessModel
Full
Incremental
Unit of access
Lifetime
UseOnce
UseMany
Etc...
These are all fairly general except for "unit of access" which is probably related to the type of the data. The next job is to come up with a proposal for how these are positioned in the DAIS model in the context of the debate around what a data set really is. I don't know the answer to this but it feels like the way that we type data and the way that we access it should separated as is the case, for example, with file descriptors and read/write operations in Unix and File and associated streams in Java.
On the transformation point I agree that transformation can be considered to be something that falls outside of the drm, dr, das, ds structure. For example, you can apply a transformation to a data set and obtain a new data set but the transformation itself does not have to be defined by DAIS. I do think that users of this technology will want to specialize the components to present clients with tailored interfaces, for example, specialized query languages or results in consistent formats across data resources, without having to chain many grid services together to achieve the effect.
Regards
Simon
Simon Laws
IBM Hursley Services and Technology
"Jim Gray" <gray@microsoft.com> on 07/02/2003 09:34:20 AM
To: Simon Laws/UK/IBM@IBMGB, <dais-wg@gridforum.org>
cc: "Jim Gray" <gray@microsoft.com>, <szalay@pha.jhu.edu>, "Tamas
Budavari" <budavari@pha.jhu.edu>, "Maria A. Nieto-Santisteban"
<nieto@skysrv.pha.jhu.edu>
Subject: DAIS Data Sets & Transformations
"DataSet" seems like a perfectly fine name.
It will need many sub-classes (file, xml doc, ODBC rowset, .NET style dataset, cube, HDF, FITS, VOtable, CSV,....) as things progress.
For example:
DataManager == FileServer
DataResource == File
DataActivity == FileHandle
DataSet == the part of the file you asked for.
Many other sub-classes (for these 4 data super-classes) will be defined as different groups define their world inside this framework.
I am particularly interested in the ".NET dataset equivalent" sub-class since that is what we are using in the Virtual Observatory, and that is what is needed by portals that want relational metadata in a single response package (tell me all your tables, columns, indices,?).
Adding transformations beyond the (odbc-speak) commands presented to the data activity seems overkill.
The commands have lots of transformations already, adding more is orthogonal to the data access issues.
-----Original Message-----
From: owner-dais-wg@gridforum.org [mailto:owner-dais-wg@gridforum.org] On Behalf Of Simon Laws
Sent: Monday, June 30, 2003 8:50 AM
To: dais-wg@gridforum.org
Subject: DAIS GGF8 Session 3 and Data Sets
Thank you to those who attended and contributed to the DAIS sessions at GGF8. This is just a short note on the conversation during Session 3 where we debated the role of Data Set. This is not the official minute but I wanted to put my recollection (and the data I captured on slides during the
meeting) out there.
In the specification to date we have set out a position where the "data set" artifact describes data that is logically disconnected from a data resource. A data set can be produced by a data activity session, moved, copied, transformed and then consumed by another data activity session so updating a data resource.
In DAIS session 3 at GGF 8 we discussed around this and started debating data set as an interface. I.e. The data resource is a data container. The data activity session is a transformation. The data sets are handles to input and output interfaces. Data set becomes a mechanism for packaging standard data access techniques. If there is a requirement to represent a collection of physical data then this is the job of the data resource.
Steve Tuecke gave an example where a client wishing to use GridFTP to move data can ask for a GridFTP compatible ds to be constructed to provide access to the data to be moved. This differs from the current position where the data set would BE the data to be moved rather than just an interface to it.
We started capturing likely properties and possible alternative names for a data set.
Properties Of a Data Set:
- Control / Properties
?Use exclusive
?Isolation levels
- Lifetime
- Open OR Use OR Connect OR Bind
- Close
- Get next item
- Type
Possible Names for a Data Set:
-Physical
?Data resource
- Logical
?Data service
- Data interface
?Data set interface
?Data handler
?Data provider / Data consumer
We didn't make any definitive decisions about the future of data set and I am not making a judgment here but a clarification of the position and role of data set is clearly required.
As the topic holder for "The Model" I propose to work over the next few weeks to develop the debate into a proposition and a new revision of the model section of the DAIS specification. Any thoughts, comments, ideas at this stage are of course most welcome.
Regards
Simon
Simon Laws
IBM Hursley Services and Technology