Provenance challenge myGrid David De Roure
tarix 07.07.2017 ölçüsü 469 b.
David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester Outline
Workflow implementation Provenance schema and storage Provenance queries Suggestions Reflection Acknowledgement
Provenance Challenge Overview
Given an abstract workflow Collect provenance from runs of this workflow Present the implemented workflow and collected provenance Answer a list of provenance questions and present these answers
Taverna and myGrid
A UK e-Science project to build middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. Sequence analysis, microarray analysis, proteomics, chemoinformatics, image processing, rendering Dilbert cartoons.
Data links Control links: limited support Failure tolerance: retry and alternative services Implicit iterations: cross/dot iterations Semantic metadata annotations
What has to be done
Design the workflow using Scufl in Taverna Build services (Web services, Soaplab services, local java, or beanshell scripts) to implement each process Gather and process the real data products
Doing it properly
Process the real data as a real experiment Use iterations, nested workflow or interactive workflows supported by Taverna Real examples: Chimatica (http://www.chimatica.co.uk/) supports high throughput workflows using Taverna 1.X MIAS-Grid (http://www.mias-irc.net/) uses myGrid to build medical image processing workflows
What we did actually
Realize each procedure as a beanshell script, to avoid real service implementation and deployment Pass pseudo data products rather than real image data products But keep the metadata about data products along with provenance to answer semantic questions
Implemented Scufl workflow in Taverna
Four aspects Workflow provenance Data provenance Organization provenance Knowledge provenance Provenance ontology
Workflow provenance ontology
Data provenance ontology
Organization & Knowledge provenance ontology
userPredicate Semantic concept about a data product or a service, e.g. nucleotide_sequence Semantic (knowledge) relationships between two data products, e.g. similar_sequence_to
Collected & stored provenance
LSIDs used to identify: data, workflows, workflow runs LSIDs are names of graphs retrieve whole workflow runs implementation in Sesame2 native store scalable alpha release (bugs) NG4J (Jena + MySQL) Future implementations: Oracle and Boca
Find the process that led to d0 ( Atlas X Graphic) excluding everything prior to d1 ( the averaging of images with softmean) Find the Stage 3, 4 and 5 details of the process that led to d0 ( Atlas X Graphic) Find all invocations of procedure align_warp using p0 (a twelfth order nonlinear 1365 parameter model)
7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.
Suggested Workflow Variants
Suggested Workflow Variants
Compare, merge and union provenance from different workflow runs Replay a workflow run
Categorisation of queries
Four levels: 1. queries to support the provenance browser 2. semantic queries 4. pre-canned queries to support provenance usage scenarios.
Taverna: http://taverna.sourceforge.net Provenance plugin and browser beta release: bundled with the Taverna release 1.4. Provenance ontology: http://cvs.mygrid.org.uk/cgi-bin/viewcvs.cgi/mygrid/miasgrid/rdf-provenance/etc/ontology/ System requirement: Windows, Linux, Mac Java 5.0 mySQL database (optional)
A systematic provenance query framework is needed Separate data and provenance metadata A consensus of provenance models
The myGrid Taverna team: Tom Oinn, Stuart Owen, Stian Soiland, David Withers, Katy Wolstencroft and June Finch Daniele Turi: provenance plugin Matthew Gamble: Taverna provenance browser Chris Wroe from the original myGrid project