How did you get into bioinformatics?
I suppose I’ve been doing bioinformatics since before the term became common currency. I started working on computational analysis of DNA and RNA sequences in the mid-1980s when I went to work at the University of Cambridge in Gabriel Dover’s lab. Diethard Tautz had just led the sequencing of the Drosophila rDNA repeat unit and I used a method he and Martin Trick had developed to analyse repetitiveness across the repeat unit (I might have done my first bit of FORTRAN tweaking it to let it run on what was, then, a very long sequence). I followed this up by developing my own software to predict RNA secondary structure to help write the associated papers and understand the patterns I saw. Since then my interest in bioinformatics has always been in what it can tell me about biological molecules and evolutionary processes, so my interest is more in application than algorithm development.
What are the challenges you see for life scientists in the data driven science era?
The difficult thing for life scientists, scientists who are interested in understanding living systems, is knowing what methods are available and whether they do what the life scientist needs. A lot of modern bioinformatics is about data management (include much of sequencing informatics in that). This is important, and indeed given the large datasets now being produced modern biology wouldn’t be possible without it, but we must never lose sight of the fact that life science is essentially about understanding living systems (and building on that to benefit mankind).
Would you say this is different for actual bioinformaticians? Do they face different challenges?
Perhaps the biggest challenges for bioinformaticians is straddling two worlds. When I started doing bioinformatics I was clearly a biologist using computational methods to learn things about DNA, RNA and protein sequences. There was no expectation that any piece of code would meet high standards of software developments (indeed those standards probably didn’t even exist then). Now a bioinformatician is expected to produce useful, easily digested results in biology and produce high quality code that is ready to share.
What is open data, and what does it mean to you?
As someone who has used open datasets to inform biological insights for the best part of 30 years, I see open data as an essential prerequisite for modern bioscience. This is not so much about re-analysing someone’s data and finding something they’ve missed or done wrong, which is what people always seem to be afraid of, as pulling together data from different sources and generating new knowledge that wouldn’t be accessible otherwise. For me, all data should be made open as soon as possible, although of course people should be able to publish. Publication is about new knowledge, though, not raw data. I think there are relatively few examples of researchers losing out having released data “too soon”.
What is currently missing in the field of bioinformatics AND life sciences?
There are so many things. If we want to combine data from different sources we need the techniques to do so. We probably need new modelling techniques to make use of disparate data. But we also need to get down to the nitty gritty and make sure that when data is published there is enough metadata associated with it that others can use if usefully. There’s absolutely no point sharing data without adequate metadata – it’s a total waste of time and energy.
It is early days yet, but what would you like to see EMBL-ABR become, achieve?
Growing training and infrastructure together is vital. The measure of success will be when your work is completely transparent – the tools, training and infrastructure you have built is enabling users to do high quality life science research, exchanging data, interacting seamlessly across continents, knowing what agreed standards people require of data so that it is accessible to all. Cloud computing will obviously be instrumental in this too. All the effort needs to come from those of us working behind-the-scenes to build working collaborations and master new technologies. And high quality training for users is essential to bring this all together.
Biosketch: After a PhD in Genetics at Edinburgh, John’s career in research in computational biology has included being a group leader at the MRC Clinical Sciences Centre, Reader in Computational Biology at Royal Holloway University of London (where he taught Computer Science and Bioinformatics) and Head of Bioinformatics at the MRC Mammalian Genetics Unit.
John was bioinformatics lead on the EUMODIC and EUMORPHIA mouse phenotyping projects and the preparatory phases of INFRAFRONTIER and the International Mouse Phenotyping Consortium. He was a coordinator of the CASIMIR project that produced high profile recommendations on data and biological material sharing and the Interphenome group that worked on standardising phenotype descriptions for mouse.
John’s research interests include evolutionary genome analysis with special emphasis on repetitive DNA and protein sequences, and semantic biology, especially ontologies for phenotype description and infrastructure for phenotype data. In 2013 he moved to BBSRC as Head of Strategy for Genomics, Data and Technologies and in 2015 to TGAC to take up his role as Node Coordinator for ELIXIR-UK.