How the Investigate-Study-Assay (ISA) framework is helping to find life science data

The open source ISA framework and tools helps to manage an increasingly diverse set of life science, environmental and biomedical experiments where one or a combination of technologies are employed.phillippers_isa

Built around the ‘Investigation’ (the project context), ‘Study’ (a unit of research) and ‘Assay’ (analytical measurement) data model and serialisations (tabular, JSON and RDF), the ISA framework helps researchers provide rich descriptions of the experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.

This week we asked Philippe Rocca-Serra, Senior Research Lecturer at the Oxford e-Research Centre, where he co-leads the ISA-tools project for ELIXIR UK, to explain why ISA is important to life science and hence to EMBL-ABR.

____________________________________________________

isatools3

What is ISA?

ISA is both a model and format to support the description of experimental studies in the field of biology.

Why does it matter for those of us in life science research?

It matters since it is part of the answer to the reproducibility problem in bioinformatics, which is always in the headlines.

Awareness of data formats and specifications helps to create data management plans, which are key to moving data management from a retrospective activity to a prospective one.

Who is it for?

For scientists who need to look after their data, for students who need to learn about managing their digital output, for librarians and data managers to handle digital assets and for bioinformaticians.

Who is using it?

Data repositories and publishers now rely on this format. At the EMBL-EBI, the Metabolights repository uses the format to collect study descriptions for metabolomics studies. It is also at the core of the Horizon2020 PhenoMenal. Now, publishers, such as Nature Publishing Group or BiomedCentral, choose ISA to back new titles such as Scientific Data and GigaScience.

The Harvard Stem Cell Discovery Engine has chosen ISA for managing datasets for its distinctive ability to support multiple data acquisition modalities. This feature sets it aside from many formats which are often tied to a data silo. At the same time, ISA maintains a mapping process to link to other formats, allowing conversion and possible deposition of datasets to those public repositories.

How is it relevant to bioinformatics in Australia?

Genomics and metabolomics techniques are becoming increasingly important to clinical applications. This is also true in translational research, biotechnology and the food industry, which are key areas for countries such as Australia. ISA has also been extended by a number of groups, such as the CaNanoLab, a data sharing portal designed to facilitate information sharing across the international biomedical nanotechnology research community to expedite and validate the use of nanotechnology in biomedicine, and the MIAPPE, which is devising a Minimum Information document which lists attributes that might be necessary to fully describe a plant phenotyping experiment. It is important for researchers to be aware of such initiatives.

How do we get involved?

ISA forum (isaforum@googlegroups.com) – email for help and support

User community – join the discussion

ISA github repository – contribute code and find all the different code repositories for the tools, specifications and docker containers for various micro-services being developed for Galaxy integration under PhenoMenal.

At the ISA-tools website learn how to:

  • collect and curate, following standards: describe the experimental steps using community-defined minimum reporting requirements and ontologies, where possible
  • store and browse, locally or publicly: create your own repository to search and browse the experimental description and associated data, hosted openly or privately
  • submit to public repositories: when required, reformat experiments for submission to supported public repositories or directly export to those already using ISA formats
  • analyse with existing tools: upload experimental descriptions and associated data to a growing number of well-known analysis systems that ISA formats connect with
  • release, reason and nano-publish: explore and reason over your experiments, open them to the linked data universe, or publish nano-statements of your discoveries
  • publish data alongside your article: directly export your experiments to a new generation of data journals that are accepting submissions in ISA formats.

References

http://www.isacommons.org/

http://bioinformatics.oxfordjournals.org/content/26/18/2354.abstract

http://www.nature.com/ng/journal/v44/n2/full/ng.1054.html

http://nar.oxfordjournals.org/content/early/2012/10/28/nar.gks1004

http://www.nature.com/nnano/journal/v8/n2/full/nnano.2013.12.html

http://link.springer.com/article/10.1007%2Fs11306-015-0879-3

______________________________________________________________________

Biosketch: Philippe Rocca-Serra received a PhD in Molecular Biology from the University of Bordeaux, moving to the field of bioinformatics upon joining the Microarray Informatics Team at the EMBL-EBI, Cambridge. There, working at establishing ArrayExpress, he became an active member of several standardisation efforts aimed at promoting the vision for open data and open science. As part of several EU projects in toxicogenomics and nutrigenomics, he coordinated the development of the ISA project [1], which now continues at the University of Oxford e-Research Centre. He is also a developer of several semantic resources such as the Ontology for Biomedical Investigation [2], and more recently STATO [3], an ontology to support reporting of statistical results and part of the OBO Foundry [4] editorial board. As part of its collaboration with Roche under IMI eTRIKS [5], he has coordinated the development of the eTRIKS standards starter pack, continuing work around trial data standardisation. More recently, he has been involved in two NIH -funded Big Data to Knowledge (BD2K) projects: Biocaddie, which recently released DataMed [6],  the ‘pubmed’ for datasets, and CEDAR [7], for building smart data collection forms in collaboration with the Biosharing catalogue of standards [8].

[1] http://isa-tools.org/

[2] http://obi-ontology.org

[3] http://stato-ontology.org

[4] http://obofoundry.github.io/

[5] https://www.etriks.org/standards-starter-pack/

[6] http://datamed.biocaddie.org/

[7] http://cedar.metadatacenter.net/

[8] http://www.biosharing.org