Data Management

Calling Algorithms

When calling an algorithm, the client will provide URLs for the data in question. The server will check first if those URLs refer to an attachment, then if they are an internal data reference (URL prefixes should match), then treats it as a plain URL. It will use the first retrieval method that succeeds as the algorithm data.


URL Dereferencing Flowchart

If the data is extant on a remote data service, the client must somehow move the data to the client or server data service, likely by delegating its move powers to the remote data service and having it put it on the server. That delegation is explained in more detail below.

Data Management Operations

Primary Operations

Copy Data

This is the primary data management operation. Data at the server is sent to the Reply-To endpoint.

Receive Data

This is the secondary data management operation. It can either be used directly, or be the receiving point of a Reply-To that's different from the sender of a web services request.Delete DataGet rid of the data specified.

3rd Party Copy Operation (Tentative)

Secondary Operations

List Data
List data managed for the calling client. This might at some point include data the calling client did not choose to have put there, but that is a minor extension.

Standards

Soap with Attachments

SwA is the most likely candidate for attachments. It is part of the WS-I Basic Profile, and provides for direct setting and manipulation of the URI used to refer to the attachment. This means the attachment can be made optional, and for the URI to then refer to other entities, as described above.

MTOM

MTOM is an increasingly popular standard for attachments, due to its increased transparency and ease of integration with other web standards. However, this makes it more difficult to handle the desired fall-over capabilities. If a solution using MTOM can be found that can fall-over as appropriate, it would be preferable to use MTOM. Since the wire format for MTOM is the same as for SwA, the choice does not impact interoperability with clients that only support one or the other standard.

WS-Addressing

This is the backbone of these capabilities. Being able to specify varying endpoints and use endpoint specifications along with built-in 'smarts' to chain calls (as in the 3rd Party Copy Operation diagram, above) makes complex processing sequences involving 3rd parties much simpler.

Security

Users will specify data as either secure or unsecure. Unsecure data will not be made readily available, but it will be possible to find it if a person tries. Unsecure data transfer will be made much simpler and more reliable by exposing it automatically over URIs when necessary to transfer it. In specific contexts data might be exposed securely over some URIs (scp, gridftp, and similar URIs, perhaps), but this would require participating instances to have a stronger and more well-specified trust relationship than can be assumed for cishell interactions.

Even unsecure data exposed over URIs will be exposed over https URIs. Web services calls will be made over https.

Secure data will rely on a permissions-based scheme. Each instance of cishell will have an associated public key/private key pair, which will usually be generated by that instance for sole use by that instance. The public key will be made generally available.

After acquiring another instance's public key, through whatever method, users can specify the holder of that public key (and particular id associated with it?) has permission to use various services (mainly algorithms and the data manager). When someone attempts to use an algorithm or other service directly, they encrypt a random token with their private key, then the called instance's public key, then their private key again. That token is included in the request, likely in an endpoint reference.

The triple wrapping is necessary to ensure that no one can re-wrap a partially unwrapped token and use it on a different service, and that no one can get at the token but the designated receiver.

The called instance decrypts all three layers (the first with the caller's public key, the second with its private key, and the third with the caller's public key again) and checks that the token has not been used before. If the token has been used before, the called instance should either ignore or fault on the input (TBD).

The same pattern applies for delegated invocation, since the token can be securely passed around to anyone trusted to make the call properly once, and cannot be reused.

There should likely be a timestamp included with the token, which would make possible token reuse after some interval, as well as providing a little extra security. This would rely on minimal synchronization between instances; say, within half an hour or an hour. If possible, the service to be called should also be included, as that will significantly reduce the utility of token-stealing by instances used as delegates, since they could only use it on the same service the original hoped to use.

Attachments:

copy3rd.png (image/png)
urlflow.png (image/png)