2016 EMBL-ABR Survey of Bioinformatics and Computational Needs in Australia

Survey of Bioinformatics and Computational needs in Australia, 2016

Demographics, Career Stage & Bio-domain

In order to gain a snapshot of the bioinformatics and computational biology needs among life scientists and medical researchers, we conducted a survey (July-Sep 2016) across Australia. In total 123 responses were received from around Australia, distributed across all states (Figure 1A).

Figure 1A and 1B

The majority of respondents were from Victoria (37%), followed by New South Wales (21%), Queensland (18%) and South Australia (15%). Postdoctoral fellows constituted 31% of the respondents, primary investigators 27% and PhD students 26% of the overall. Respondents were asked to identify their institution and more than 35 different institutions, universities and research institutes were named (Figure 1B).

Asked to self-assign to a research academic discipline, respondents selected the most appropriate terms from a supplied list. More than 32 biological domains were represented, with a majority of responses in the fields of Evolutionary Biology, Cell Biology, Biochemistry, Agriculture, Microbiology, Biomedical Research, Molecular Biology and Genetics (Figure 2). Almost 50% of respondents selected “Bioinformatics” in addition to their other domain choice.

Figure 2

Respondents were asked to identify the datatypes they work with in their research. As shown in Figure 3, 82% work with sequences (DNA/RNA/proteins) and 50% work at the level of interactions and pathways. Similarly, respondents were asked to identify which biological systems they work on and a large majority selected humans (52%) and model systems across plants, animals and microbes, while non-model organisms were selected by approximately 20% of the total pool. Three-quarters of the respondents (76%) declared that they work with large data sets (e.g. genome sequences, RNA-Seq or CHIP-Seq). Three-quarters of respondents (75%) reported working with more than one datatype (40 work with two types, 32 with three, 15 with four, and 5 with five datatypes); only 31 (25%) work with a single datatype.

Figure 3

Respondents were also invited to classify their level of bioinformatics skills on the scale of: never use bioinformatics tools, beginner, intermediate and advanced. Surprisingly, seven respondents declared they “never use bioinformatics tools”, with others being 40% beginners, 32% intermediate and 32% advanced. When asked how much data they work with per month (on average), 28% declared they work with 10-100 GB, 26% – 100 GB-1TB, and 13% greater than 1TB. The final 20% of respondents selected 10GB, while 13% declared that they did not know.

Data analysis needs

Respondents were asked to indicate how important the following aspects were for their research:
1. updated analysis software
2. multi-step analysis workflows or pipelines
3. high-performance or cluster computing and Cloud computing (remote, configurable, on-demand computers).

They were also asked if either their institution or a national service meets their current needs and how important each aspect would be to their research in three years. The results in Table 1 show that Updated Analysis Software and Multi-step analysis workflows of pipelines are considered important both currently and in the future but perceived as services that are not well resourced now. Responses on HPC/Cluster Computing and Cloud Computing reflected the relatively recent emergence of the latter, with around 30% of respondents unsure whether Cloud Computing was well resourced (31%) or would be important in three years (29%). High performance or cluster computing was considered important (76%) but relatively well resourced (65%) compared with all other categories.

Table 1: Summary of national survey responses to questions about the respondents’ current and future needs for and current access to updated analysis software, multi-step analysis workflows or pipelines, HPC or cluster computing and cloud computing. Response rate to these questions was 97-100% (total 123 respondents).

Research need Question Yes (%) No (%) Don’t know (%)
Updated analysis software Important now 91 4 5
Your institution or national service meets this need 58 21 21
Important in 3 years 91 0 9
Multi step analysis workflows or pipelines Important now 82 10 8
Your institution or national service meets this need 37 40 23
Important in 3 years 85 2 13
High performance or cluster computing Important now 76 16 7
Your institution or national service meets this need 65 13 22
Important in 3 years 79 2 18
Cloud computing Important now 57 32 11
Your institution or national service meets this need 52 17 31
Important in 3 years 66 5 29

 

Data Storage, Discovery and Sharing Needs

The survey asked a series of questions regarding data storage, discovery, and sharing needs. Researchers were asked how important it was for their research to have sufficient data storage, to search for data and discover relevant datasets, to share data with colleagues, and to publish data to the community and/or archives. These were all rated as important, with 80-94% of respondents indicating current importance in each case. These numbers increased slightly in relation to importance in three years, to 86-97%.

A reasonable conclusion here is that needs in this area are not all being met, since only 76% responded that institutional or national data storage provision was satisfactory, 61% had needs met for data sharing with colleagues, 53% for publishing data to the community and/or archives, and 42% for the data discovery question. This highlights four areas that should be re-examined to ensure existing resources are appropriate for life scientists and that the life scientists in turn are aware of the resources.

When asked to describe with whom they aim to share their data, 70% of respondents reported sharing with supervisors, colleagues and/or collaborators without mentioning the wider scientific community, public data sharing, or public data repositories. One respondent claimed to share data with no-one, and three were unsure. However, 25% of respondents mentioned making data accessible to “everyone”, “the scientific community” or similar (in most cases in addition to sharing within their research group or collaboration). The majority of respondents reported using email, local servers, Dropbox, Google Drive or even portable hard drives to share data. Four respondents mentioned the Australian CloudStor service, and eight reported using other file transfer protocols (presumably to transfer data from local servers). Some respondents were unsure of the sharing method used in their group, and in some cases confusion was evident about what data sharing means (an example comment: “… High Performance Computing… would give clinicians access to my findings”). Four respondents claimed to share data only in publications.

Table 2: Summary of national survey responses to questions about the respondents’ current and future needs for and current access to data storage, data discovery approaches, data sharing and data publishing. Response rate to these questions was 99-100% (total 123 respondents).  

Data Need Question Yes (%) No (%) Don’t know (%)
Sufficient data storage Important now 93 6 1
Your institution or national service meets this need 75 17 7
Important in 3 years 98 0 2
Data discovery Important now 80 15 5
Your institution or national service meets this need 41 29 30
Important in 3 years 85 2 13
Sharing data with colleagues Important now 89 10 1
Your institution or national service meets this need 61 20 19
Important in 3 years 93 1 7
Publishing data Important now 82 15 3
Your institution or national service meets this need 52 19 30
Important in 3 years 87 4 9

Computational support and training needs

Bioinformatics and analysis support was considered important by 90% of respondents for their current research, with a similar percentage considering it would be important in three years. However, there is a clear lack of support services being currently offered, since only 46% responded that this was currently provided by either their institution or a national service, with 42% declaring it is not provided and 12% declaring they didn’t know.

We then asked respondents about their training needs in four specific topics, namely:

  • basic computing (Linux) and scripting (Python, R)
  • data management and metadata
  • integrating multiple data types
  • scaling analysis to cloud or HPC.

The results shown in Table 5 indicate that training on these four aspects is considered relatively important with 78% of respondents indicating so for scripting and Linux. However, fewer than half of respondents considered their current training needs are being met by either their institution or by a national service, suggesting that bioinformatics training remains a very high priority. It is important to notice that the training needs herein described are not just from PhD students, but extend to postdocs, technicians and researchers.

Table 3: Summary of national survey responses to questions about the respondents’ current and future needs for and current access to training. Response rate to these questions was 97-100% (total 123 respondents).  

Training Question Yes (%) No (%) Don’t know (%)
Bioinformatics and analysis support Important now 89 8 2
Your institution or national service meets this need 46 42 12
Important in 3 years 91 3 6
Basic programming and scripting Important now 78 20 2
Your institution or national service meets this need 46 38 16
Important in 3 years 78 12 10
Data management and metadata Important now 72 20 8
Your institution or national service meets this need 20 49 30
Important in 3 years 80 5 16
Integrating multiple datatypes Important now 67 19 14
Your institution or national service meets this need 17 50 33
Important in 3 years 74 3 23
Scaling analysis for cloud and high-performance computing Important now 55 22 22
Your institution or national service meets this need 23 28 49
Important in 3 years 63 7 30

Finally, we asked respondents to describe what tools, workflows, or other services they would most like to see provided. Only three respondents (2.5%) claimed to need no extra tools, workflows or services beyond what they currently have. The need for training was further highlighted in the answers to this question, with 40% of responses requesting training provision of some kind. Only 9% of responses specified a need for basic or introductory programming training. One respondent requested less training – “Less training, more resource funding and allocation” – which may indicate a frustration with general under-resourcing for bioinformatics as a whole. Responses covered a diverse range of domain-specific software tools, methods and workflows. Some notable commonalities were references to the need for support in, provision of or training in pipelines/workflows (22% of responses); in statistics, downstream analysis or interpretation of results (9%); in data discovery, management or storage (7%) and in understanding best-practice approaches, identifying “good, validated” bioinformatics tools/pipelines or accessing expert advice (7%).

Survey Data

The above summary of the survey was conducted on the 123 responses collected by the end of September 2016. Since then we have received nine more responses and the actual data can be found here.

ACKNOWLEDGEMENTS

This survey was originally developed by Cold Spring Harbor Laboratory on behalf of CyVerse.

How to Cite

Schneider, Maria Victoria; Flannery, Madison; Griffin, Philippa (2016): Survey of Bioinformatics and Computational Needs in Australia 2016.pdf. figshare.
https://dx.doi.org/10.6084/m9.figshare.4307768.v1