Poster submission is open (and you can also submit late oral presentation abstracts)
Conference Abstracts and presentations
The majority of the Conference on 4-5 February will be filled with presentations from the community, including oral presentations, posters and lightning talks. Presentations can span the full range of topics relevant to data intensive biology.
BIO Day Oral presentations
Keynote: Chromosome conformation in context
Enhancing Knowledge Discovery from Cancer Genomics Data with Galaxy
The field of cancer genomics has demonstrated the power of massively parallel sequencing techniques to inform on the genes and specific alterations that drive tumour onset and progression. Although large comprehensive sequence data sets continue to be made increasingly available, data analysis remains an ongoing challenge, particularly for laboratories lacking dedicated resources and bioinformatics expertise. To address this, we have produced a collection of Galaxy tools that represent many popular algorithms for detecting somatic genetic alterations from cancer genome and exome data. We developed new methods for parallelization of these tools within Galaxy to accelerate runtime and have demonstrated their usability on cloud-based infrastructure and commodity hardware. Some tools represent extensions or refinement of existing toolkits to yield visualizations suited to cohort-wide cancer genomic analysis. For example, we present Oncocircos and Oncoprintplus, which generate data-rich summaries of exome-derived somatic mutation. Workflows that integrate these to achieve data integration and visualizations are demonstrated on a cohort of 96 diffuse large B-cell lymphomas and enabled the discovery of multiple candidate lymphoma-related genes. Our toolkit is available from our GitHub repository as Galaxy tool and dependency definitions and has been deployed using virtualization on multiple platforms including Docker.
Using Galaxy for Microbial Research Consortium
MetaSUB (Metagenomics & Metadesign of Subways & Urban Biomes) is a project to assess microbial communities on the surfaces of built environments, especially in high-traffic areas such as subways. Researchers in various cities around the world are collecting and sharing microbial information to create a worldwide database. Microbial data obtained from subway trains and stations in New York and Boston have already been publicly available. Those datasets will benefit smart city design, public health, and discovery of new microbial species and genes. The activity to generate data and to share the analysis results attracts not just researchers but also general people.
To promote such ‘citizen sciences’, web platforms to manage datasets and their analysis workflows are essential. The requirements of the platforms are as follows: 1) the platform should have user friendly interface since the users are not data analysis professionals; and 2) the platform should provide the exactly same data analysis workflows to the users and the workflows are robust and reproducible. Galaxy is one of the best candidates as a workflow management platform.
Therefore, we are now building Galaxy tools and their hosting environments for sharing datasets and tools among people affiliated with international consortiums including the MetaSUB consortium. In this session, we will discuss good practices and future tasks when we use Galaxy for global collaborations.
Genomics Virtual Lab – Queensland
Nick Hamilton, University of Queensland Research Computing Centre, Australia
Igor Makunin, University of Queensland Research Computing Centre, Australia
Derek Benson, University of Queensland Research Computing Centre, Australia
The Genomics Virtual Laboratory – Queensland is a set of managed services allowing genomics researchers to analyse data using a range of tools within a single easy to use interface that does not require them to have programming skills. Genomics Virtual Lab – Queensland services include: a public Galaxy server and associated storage for analysis of nextGen sequencing data; an RStudio server; a full mirror of the 1000 Genomes project data; and a demonstration Beacon server that serves as a tutorial on how researchers may set up their own server on cloud resources. This presentation will provide an overview of these GVL-QLD services, their uptake by researchers, and our associated training and skills development programs.
Genomics Virtual Lab – Queensland is operated by the University of Queensland Research Computing Centre using funding from QCIF and the National Collaborative Research Infrastructure Nectar and Research Data Services projects.
Building Galaxy Community VM
For biomedical researchers, there are two barriers when they start using Galaxy. First, while Galaxy tools and workflows are shared in public repositories, new users can hardly get the information on how other research institutes use those tools and design workflows. Second, they may often not be able to reproduce the workflows they used before, since it is difficult for individual researchers or small laboratories to maintain their systems, so their Galaxy environments get often unrecoverable when they change the settings or rebuild their computers.
To solve these problems, Galaxy Community Japan holds a monthly meet-up to share our workflows. We also distribute a virtual machine image, on which we configured Galaxy with necessary tools and workflows, and make available our practical know-how about these tools and workflows on our website. Users can download and run the virtual machine on their own PC or launch it on AWS, so they can immediately try pre-installed analysis workflows with their own data.
The latest version of this virtual machine is running on our public test site, while the older versions are kept downloadable too. As a result, users can run the same workflows on different computational infrastructures and always reconstruct the Galaxy environments they have used before. This will also help developers advertise their new tools to potential users. We would like to introduce several newly developed tools on our Galaxy, as well as our experiences in the local activities such as Galaxy Workshop Tokyo.
Galaxy Training Network
Doing Speech Science with Galaxy
Alveo is a Virtual Laboratory providing services for language researchers; it holds large collections of speech, text and video data that are used as source material in research, and provides an API to support tools that work on the data. We have been exploring Galaxy as a workflow platform as it meets our goal of providing an easy to use interface to complex tools for non-technical users.
This talk will give an overview of some of the tools and workflows we have developed and raise some issues that we are thinking about in our use of Galaxy. One of these is the common pattern of work in our field where a workflow is applied to a large number of source documents to generate data for analysis. For example, I may put thousands of audio files through a signal processing pipeline to generate a table of results that is then used in a machine learning workflow. I’ll describe some of the challenges we’ve faced in implementing these kinds of pipelines within Galaxy.
GigaGalaxy, and publishing workflows for publishing workflows
There is growing awareness of the reproducibility crisis in science due to the actual artefacts which support scholarship (the data and methods) remaining largely inaccessible. And that these artefacts need to be FAIR: Findable, Accessible, Interoperable, and Reusable. GigaScience is a journal trying to change this, where the focus is on assessing research on reproducibility and re-usability rather than subjective impact. This is facilitated by GigaScience’s database, GigaDB, where the data, workflows and tools used for analysis hosted and made publicly available with citable DOIs. GigaScience also has a GigaGalaxy server that helps in hosting, visualizing and reproducing workflows. Gathering many other reproducible ways of sharing research through the publication of dynamic Knitr documents, iPython workbooks, virtual machines and docker containers, has shown this approach to publishing is applicable beyond genomics, publishing VMs of end-to-end metabolomics pipelines, image processing algorithms and beyond. Finally, we have investigated the extent by which the results from articles published in GigaScience can be made reproducible using Galaxy in a pilot project based on a previously published paper on the SOAPdenovo2 genome assembler. To quantify and test how reproducible this was, this study was subjected to a number of FAIR data models, including ISA-, ResearchObjects, and nanopublications. This presentation will present GigaScience’s experiences of what should be best practice in publishing FAIR and reproducible research; particularly when using Galaxy workflows. In addition, it will also highlight problems that we have encountered while doing this, and the results of our reproducibility exercise.
Is there a simple solution to integrate the multi-omics data? From Lab book to databases!
For many decades, the natural phenomenon to understand the complex biological process is to find the similarities and/or differences that are observed in between various organisms. Later the journey has shifted to the cross comparison of similarities/differences within the same organism but across different species and up to sub cellular level. With many decades passed with this steady but slow progression, a revolution came through the human genome project that we refer as post-genomic era. With the divide and conquer principle, we are successful enough to understand the biology up to certain extent at the level of gene, protein, transcript, metabolite and many other individual components/processes. However, these individual sub domain studies are still far away from giving a complete answer to many complex biological problems. To bring potential answers to the surface, it is necessary to go with the concept of ‘study the system as a whole’. It not only gives the idea of the functional aspect of an organism at global level, but also gives complete knowledge about how each cellular constituent function together in order to accomplish their particular task! But the question is how to get to there and what all we have to do to bring the information that is obtained from many approaches together to capitalize the potential these individual domains hold. To address this issue, we are trying to develop an integrated platform to combine the genomic, proteomic and metabolomics resources and help enhancing our understanding level of biology.
Keynote: Dynamics of epigenetic modifications
INFO Day Oral presentations
Keynote: Adventures in scaling Galaxy
Integration of Galaxy with a Supercomputer
The National Institute of Genetics, Japan (NIG) operates a supercomputer designed for data-intensive computation in life science and medical researches. Primarily, the NIG supercomputer is used for the construction of, a comprehensive database of all publicly available DNA sequence. The NIG hosts the DNA Data Bank of Japan (DDBJ), which is a member of the International Nucleotide Sequence Database Collaboration (INSDC), together with US-based the National Centre for Biotechnology Information (NCBI), and European Bio-informatics Institute (EBI), based in UK. In addition, the NIG supercomputer is offered as high-performance computing resource for life science research and education to researchers belonging to universities and national and public research institutes in Japan. As of Nov 2016, it has almost 2,800 registered researchers and the number is increasing rapidly, reflecting the recent data deluge.
Since the majority of the NIG supercomputer users perform NGS analysis (about 80%; usage purposes of remaining registrants are simulation, database construction, natural language processing, and so on.), Galaxy is a good starting point for constructing NGS analysis pipelines for those users. The challenge is how to integrate Galaxy with parallel computing environment with achieving effective resource management. In this report, we introduce our hardware and software stack to allow the integration of Galaxy and HPC, in which each bioinformatics tool is encapsulated in a Docker container and invoked in suitable calculation nodes, with the aid of enterprise resource managers including Univa Grid Engine and Apache Mesos.
Building Virtual Laboratory Software Infrastructure
The GVL has been in active development for the last 4 years and has seen significant adoption. Deployments have spread around the world with thousands of launched instances and replicas built on several national clouds. As the user base has grown, expectations have outpaced the original system design goals. For example, users and deployers have come to expect the GVL to be extensible and customisable – with their own tools and for their own specialised requirements. One example is the MicrobialGVL, which extends the base GVL to suit the microbial bioinformatics community. Increasingly, a diverse suite of clouds needs be supported for round-the-globe availability and pre-building the underlying resources for each supported cloud is no longer a scalable option.
To meet these expectations, we have distilled the core layers required to build a flexible, scalable, maintainable, and robust system. We have built the layers necessary to support major cloud providers and a launcher layer that offers an extensible application deployment platform. This facilitates the ability to deploy on custom clouds of choice and to extend a running instance of the GVL with user-selected applications or application suites, with current examples including the SMRT Analysis suite, LOVD and IRIDA.
In this talk, we will present an overview of the challenges of building Virtual Laboratory software infrastructure, and the work done so far to meet those challenges. We will present the new features from an end-user perspective as well as describe how developers and deployers can use it to define and deploy applications.
GalaxyP: A Galaxy Proteomics Tools Community
GVL and Nectar
Resource planning on the Cloud: exploring the scalability spectrum
Cloud computing resources have become the informatics backbone for scalable, accessible, customizable, and secure computing with bioinformatics continuing to benefit from this computational model. What started as a handful of applications that were ported to the Cloud has geared up to creation of Virtual Laboratories and Cloud Pilot projects funded by national funding agencies. Today, a typical end-user scenario for the cloud is to acquire a set of virtual machines from a cloud provider with pre-installed software and perform the needed data analysis. In the process, the user needs to make cost-effective decisions about what resources to acquire and how many. These decisions have a direct impact on the outcome of the analysis because with insufficient resources it may be impossible to complete the analysis or it may take extra time. Excessive resources waste project funds or merit allocation credits and can cause resource contention on academic clouds.
To shine some light on this topic, we performed a number of experiments with the Galaxy CloudMan project to explore the tradeoffs among resource types and sizes across the Amazon Web Services infrastructure. Using published next generation sequencing data we identified resource requirements, limits on resource classes, and observed actual resource utilization for RNA-seq and chIP-seq pipelines. These results can be used help users gauge what resources to use when using cloud machinery. They can also be used by academic cloud infrastructure projects to determine what type of underlying infrastructure is needed by users. In this talk, we will detail our findings.
Utilizing the Google Cloud to Scale and Reproduce Galaxy Instances
In this talk you’ll learn patterns to optimize the use of Google Cloud resources in scaling and reproducing Galaxy environments for research. Topics covered include how best to use cloud-based virtual machines (Google Compute Engine) to scale Galaxy instance objects (workflows, etc.…) as well as creation of virtual machine disk images for development environment reproducibility via Docker containers.
You will also see a demo of the Google Genomics API, which can be used to store, process, explore and share genomic data using the standards defined by the Global Alliance for Genomics and Health. This includes support for managing datasets, reads and variants; searching and slicing and setting access control for sharing. Google Big Query will also be included in this demo, showing interactive variant result analysis.
And the talk will also include gcloud scripts which demonstrate best practices for reducing cost of cloud services, such as stopping and starting GCE instances (VMs) when not in use, along with other cost saving methods will be covered. Coverage of security, auditing, logging and monitoring using GCP tools and processes, such as GCP Stackdriver will also be included.
The key takeaway is effectively using either cloud-based virtual machines or docker containers for scalability and reproducibility of research environments for Galaxy.
Galaxy in Public Health: the Microbial Genomics Virtual Laboratory
The uptake of genomics in public health and clinical microbiology laboratories is being slowed by the perceived requirement that each laboratory needs to, counterproductively, establish and evaluate their own tools and infrastructure which will result in a lack of standardisation of methods.
An easily instantiated computer image based around Galaxy with a defined set of microbial-specific tools and reference data is an ideal solution for enabling standardisation between laboratories. We have established the Microbial Genomics Virtual Laboratory (mGVL) to empower laboratories to establish their own private operating environment to using standard software and analysis methods that are suited to government accreditation.
The mGVL consists of machine images for performing genomics analyses in a scalable, reproducible manner, plus web tools for instantiating and managing the images on multiple cloud architectures. The images incorporate a number of pre-configured genomic analyses platforms including Galaxy, the Linux command line, RStudio and Jupyter Hub.
The mGVL images are constructed from Ansible scripts which make it straightforward to customise. The focus of this talk would include the tailoring of the Galaxy environment to Microbial Genomics via the Ansible scripting and the availability of the Galaxy file system to other web services and the linux command line.
Connecting Galaxy and GenomeSpace on OpenStack
Yousef Kowsar, VLSCI, University of Melbourne, Australia
Nuwan Goonasekera, VLSCI, University of Melbourne, Australia
Derek Benson, University of Queensland, Australia
Igor Makunin, University of Queensland, Australia
Andrew Isaac, VLSCI, University of Melbourne, Australia
Anna Syme, VLSCI, University of Melbourne, Australia
Simon Gladman, VLSCI, University of Melbourne, Australia
Nigel Ward, Queensland Cyber Infrastructure Foundation, Australia
Andrew Lonie, University of Melbourne and EMBL-ABR, Australia
Galaxy is a sophisticated workflow platform for bioinformatics analysis and visualisation, allowing complex, reproducible compute intensive interactive and scheduled analysis, but it is not focussed on the management of user data. Researchers typically work with a number of unstructured files representing data, metadata, analysis intermediates and outcomes, documents, etc that are generally accessed ad-hoc through a read-write file system. GenomeSpace is a file-centric user workspace platform tailored for bioinformatics; it allows users to view and manage their files and data in an accessible, interactive web environment, and brokers the connection and transfer of a user’s data to and from a suite of analysis and visualisation tools. The combination of a data-centric workspace like GenomeSpace connected to a workflow-and-tool-centric computational analysis environment such as Galaxy is a powerful paradigm for highly accessible genomics analysis.
We demonstrate an implementation of GenomeSpace and Galaxy, connected and implemented as services on the Openstack-based Australian Research Cloud. We identify the different infrastructure drivers of a user workspace (data-centric, object-based, unstructured file collections, persistent) and computational analysis environment (workflow-centric, scalable, POSIX-based, transient), and conclude that both are necessary to provide a flexible, user-focussed, web-accessible platform for bioinformatics.
Automatic Configuration of Galaxy Servers over Multiple Cloud Platforms Using Virtual Cloud Provider
Kento Aida, National Institute of Informatics, Japan;
Atsuko Takefusa, National Institute of Informatics, Japan;
Yoshinobu Masatani, National Institute of Informatics, Japan;
Shigetoshi Yokoyama, National Institute of Informatics, Gunma University, Japan
As a result of the performance improvement of cloud computing platforms, there is a growing interest in the use of cloud computing for running data analysis platforms such as Galaxy servers. However, there are several issues in configuring application environments on cloud platforms. First, building such environments is not easy for the users who are not familiar with computer engineering. Also, it is a time-consuming task to configure the network settings of computer cluster and to deploy various software. Second, normally a single cloud platform does not fullfill the diverse application requirements. For example, some applications may use data resources distributed across distant locations. Therefore, it is necessary to configure application environments over multiple cloud platforms. Third, the reproducible configuration of computational experiments is important for applications. In the configuration, we have to carefully unify the versions of OSes, libraries, and that of other underlying software. It this talk, we will introduce Virtual Cloud Provider (VCP) architecture, which automatically configures data analysis platforms over multiple cloud platforms. The key ideas of the VCP are (1) configuring a virtual network over multiple cloud platforms using a high-performance L2VPN service such as SINET5 (Japanese academic network) L2VPLS service, and (2) using Linux container technologies to achieve easy and fast deployment of software. We also show how the prototype of VCP deploys the application environment, which consists of Galaxy server and Apache Mesos cluster and it can produce reproducible results.
Keynote: Galaxy’s and communities in one Universe
Submit an abstract
The Galaxy community is dedicated to the principles of open access, and part of that commitment will make all GAMe presentations available online after the conference. By submitting an abstract you:
- Agree to make your slides/posters freely available online no later than 14 February 2017
- Agree to have your presentations videotaped/photographed and made publicly available during and/or after the meeting
BIO / INFO
The meeting will have a BIO / INFO split, with the first day emphasising biological research and applications, and the second focussing on technical aspects of Galaxy – for bioinformaticians, tool and software developers, and research infrastructure providers. Some presentations will map clearly to one day or the other, but some will apply equally well to both themes. The abstract submission form asks which day(s) you would like to present on.
Posters will be presented during two poster sessions. All posters will be visible for the full two days of the Conference 4-5 Feb 2017. Poster content can overlap with content presented in other formats at the meeting.
Poster abstract submission is open until we run out of poster space. Abstracts are limited to 250 words or less. Poster abstracts are reviewed on a rolling basis and submitters will be notified of the decision within two weeks of the abstract’s submission date.
Accepted talks will be presented during the 2-day Conference. Talks will take 15-20 minutes. The meeting is single track and talks will be presented to all attendees.
The oral presentation abstract submission deadline closed 14 December 2016. However, you can still submit late abstracts, which will be considered as openings/cancellations occur. Abstracts are limited to 250 words or less.
Not yet ready to present your work in a 20 minute presentation? You can still share your work! Lightning talks are short, focused presentations (5-7 minutes) featuring topics submitted by meeting participants during the meeting itself.
The call for lightning talks will go out shortly before events start.