ePlants pipeline and navigator for accessing and integrating multi-level ‘omics data for 15 agronomically important species for hypothesis generation

Overview

In the past five years alone, huge amounts of data have been generated for 15 plant species important for Canada, including poplar, maize, rice, barley, wheat, soybeans and tomatoes. Being able to efficiently use these data will be key to improving and managing these crops to feed, shelter and power a world of 9 billion people by the year 2050.

The ePlant Framework, developed under a previous Genome Canada grant, permits researchers to easily see where and when a gene is “active” and whether there are natural genetic variants that might allow it to do its “job” better; populated only with one species, it now needs data from more species. Lead researcher Dr. Nicholas Provart (University of Toronto) plans to develop an ePlant Pipeline to facilitate the ability to create any ePlant, based on genomic or exome sequence data. The ePlant Navigator will permit cross-cultivar and cross-species comparisons, supporting robust hypothesis generation. Easy access to these data sets will enable researchers to explore genetic diversity, gene expression, and other data for important genes towards crop improvement.

Kamphir: A versatile framework to fit models to phylogenetic tree shapes

Overview

Phylodynamics is a new and rapidly growing field that combines epidemiology and computational biology to combat infectious disease outbreaks. The field stems from the concept of phylogeny, in which a tree represents how different populations – of virus infections, for example – are related through a series of common ancestors. The genetic similarities among populations are used to reconstruct these ancestral relationships back in time. This is particularly important for viruses, which evolve so quickly that each infection becomes genetically unique within weeks or months of being transmitted from the previous host. Consequently, the virus phylogeny can be used to estimate how the infections spread through the host population. Phylodynamics has already had an enormous impact on our understanding of outbreaks including HIV, hepatitis C virus, and Ebolavirus. Further progress is stymied, however, by simple models that can’t accommodate large data sets.

Dr. Art F.Y. Poon of Western University, Ontario, is developing a completely new approach to phylodynamics that adapts a method from pattern recognition to enable computers to “see” the shared features of different tree shapes. This approach will have an unprecedented capacity for more realistic models and larger data sets, improving global public health initiatives for infectious disease management and eradication.

Dockstore: A platform for sharing cloud-agnostic tools with the research community

Overview

An unintended consequence of the development of genomics has been the proliferation of massive datasets, making analysis increasingly difficult. A further problem is the lack of standardization in how analysis tools are packaged, described and executed across computer environments. Drs. Vincent Ferretti and Lincoln Stein of the Ontario Institute for Cancer Research, in collaboration with Dr. Brian O’Connor of the University of California, Santa Cruz, have developed a web application called the Dockstore, which addresses the challenge of encapsulating and sharing bioinformatics tools so that they can be moved from environment to environment.

Now the researchers are adding key features to the Dockstore to continue to enhance and evolve the platform. They will also integrate bioinformatics tools and workflows from the Global Alliance for Genomics and Health (GA4GH) for redistribution to the larger research community and will work with collaborators to facilitate the registration of their high-quality tools into the Dockstore. Finally, the researchers will work with other projects to enable sharing of tools across genomic repositories. These activities will drive increased usage of the Dockstore, thereby increasing tool sharing among scientists in fields as diverse as agriculture, energy and human health.

Consolidated epigenetic landscape for congenital, developmental and childhood disorders

Overview

Epigenetics is the study of both genetic and external factors, such as environmental exposure or lifestyle choices by parents or grandparents, which affect gene expression. Epigenetic disruptions play a key role in disease. Finding epigenetic biomarkers, however, is complicated by the complexity of epigenetic signaling in cells or tissues, as well as the fact that many different genetic disorders, such as pediatric developmental disorders, can show similar clinical symptoms. Despite the wealth of data being generated by new technologies, there is a dearth of diagnostic tools that can consolidate epigenetic data collected by diverse groups using different experimental platforms. These tools are essential to relate molecular patterns to clinical presentation.

Drs. Michael Brudno and Rosanna Weksberg of Toronto’s Hospital for Sick Children are developing a novel web-based resource for analyzing epigenetic datasets together with complete clinical information, focusing on developmental disorders such as intellectual disability and autism. Their system will provide a rich context for exploring epigenetic dysregulation in a growing number of childhood epi-genetic diseases.

Enhanced and automated visualization of complex data

Overview

Modern genomics research generates massive amounts of data. But these data sets are too big and complex to be useful on their own. Researchers must first analyze and interpret biological data to better understand them and turn them into meaningful information. This information can then be used to help solve real-world problems, such as developing new tools or strategies to better diagnose and treat patients, increasing crop yields or monitoring the environment. Increasingly, the ability of the human end-user to interpret the data is the key factor limiting researchers from delivering these much-needed solutions more quickly.

Dr. Paul C. Boutros of the Ontario Institute for Cancer Research is leading a team developing ways of making “big data” results more easily understood by improving the way it is visualized and interpreted. The team will create interactive visualization tools that will integrate tightly with databases scientists already use routinely. The team will use crowdsourcing to capture the best visualization ideas from a broad community of scientists, graphic designers and citizen-scientists. The project will build on the human brain’s ability to interpret images, to make the conclusions of biological data more readily accessible and accelerate the rate of biological discovery and innovation.

Extracting Signal from Noise: Big Biodiversity Analysis from High-Throughput Sequence Data

Overview

Surveying biodiversity is critical for environmental health and for managing natural resources. It helps to assess the impact of resource development, but also to identify pests, invasive species, and pathogens in a rapid and cost-effective manner. It is essential to Canada’s economic growth in the forestry, agriculture, and fishery sectors and to decision-making in public health. Genetic methods of surveying biodiversity, such as high-throughput sequencing, are being broadly adopted, but bioinformatics has not kept pace with the data being generated. In addition, current methods are geared toward bacteria and similar organisms, rather than multi-celled plants and animals that need monitoring as well.

Drs. Sarah Adamowicz and Paul Hebert, along with colleagues from the University of Guelph, are creating new bioinformatics tools that will facilitate the rapid and accurate processing of DNA data resulting from high-throughput sequencing. The tools will enable the simultaneous analysis of bulk samples, which are made up of many different species. It will include a de-noising tool to detect errors; a method to cluster DNA sequences into species-like units to permit biodiversity analysis; and a method for assigning sequencing data to higher taxonomic categories to unlock functional biological information. The team will combine these various tools into a biodiversity informatics pipeline that can be incorporated into existing web-based platforms for uptake by a broad variety of users.

The new biodiversity informatics tools will support large-scale biodiversity research by academics; efficient, accurate, and cost-effective environmental assessments for the mining and pulp-and-paper industries; enhanced capacity and accuracy of regulation; and more rapid and accurate biodiversity data for government and private-sector decision-makers.

From ePlants to eEcosystems: New Frameworks and Tools for Sharing, Accessing, Exploring and Integrating ‘Omic Data from Plants

Overview

Major advances in plant biology over the past decade are in large part thanks to new technologies for DNA sequencing and phenotyping (i.e. mapping the physical expression of genetic traits). The resulting datasets allow researchers to determine how different plants develop and respond to changes in their environment. Yet, in order to take advantage of the tremendous amount of new data, innovative tools are required to integrate and visualize the number of individual data points in different datasets. The original ePlant system, developed as part of a previous Genome Canada effort, integrates many data types but was not configured for phenotype data. Amongst its many applications, phenotype data provide important information on traits of interest to plant breeders.

Drs. Nicholas Provart of the University of Toronto and Jörg Bohlmann of the University of British Columbia are developing a new module to integrate the wide variety of data available, including ecosystem data, phenotypes and genotypes into ePlant. This will be done for the already existing ePlant species and any new ePlant species to be developed as part of this project. The researchers will also open the ePlant system to the research community to build a larger ePlant ecosystem of information. This online system will act as a resource where plant biologists will be able to share their datasets.

Ultimately, these tools can help to accelerate the task of identifying useful genes to feed, shelter and power a world of nine billion people by the year 2050.

Dockstore 2.0: Enhancing a community platform for sharing cloud-agnostic research tools

Overview

With Genome Canada support, Dr. Lincoln Stein of the Ontario Institute for Cancer Research successfully developed Dockstore, a system that enables complex computational biology algorithms to be run reliably and reproducibly across multiple platforms. It has been adopted as the leading packaging technology by the Global Alliance for Genomics and Health and is now used by numerous third-party bioinformatics groups. Marc Fiume of the Canadian company DNAstack is collaborating with Dr. Stein and his team to maximize the utility of Dockstore.

The aim of these enhancements is to promote greater collaboration and sharing among computational biology software developers. Specifically, the enhancements will make Dockstore easier to use, make its packages more powerful and expressive, increase its interoperability and enable these packages to run more easily on a wide range of systems and hardware architectures. The bioinformatics and computational biology community will benefit from this software, while the research results derived from it that are reproducible, portable and reusable.

CReSCENT: CanceR Single Cell ExpressioN Toolkit

Overview

Tumours are complex mixtures of cancer, immune, and normal cells that interact and change during treatment. The interplay of all three types of cells can dictate development of cancer over time, as well as response or resistance to treatments. Recent advances in microfluidic and DNA sequencing technologies have enabled researchers to simultaneously analyze tens of thousands of single cells from complex tissues, including tumours. Interpreting these data is challenging, due to the lack of high-quality reference sets of each cell type in the body and a lack of methods to link these data back to tumour biology.

Drs. Trevor Pugh of the Princess Margaret Cancer Centre and Michael Brudno of The Hospital for Sick Children are developing the CanceR Single Cell ExpressioN Toolkit (CReSCENT), a scalable and standardized set of novel algorithmic methods, tools, and a data portal deployed on cloud computing infrastructure. To allow comparison of cells in cancerous and healthy tissues, the system will aggregate single-cell genomic data generated by cancer researchers and connect them to international reference data generated by experts from around the world as part of the Human Cell Atlas. This data sharing and aggregation system is a key differentiating factor in CReSCENT that will increase researcher productivity by accelerating execution and comparison of computational methods, as well as providing contextual data for understanding how cells behave within tumour tissues.

This platform, which will be useable by any researcher on any computing platform, will assemble a crucial data resource to navigate the upcoming wave of single cell cancer genomics research. CReSCENT will bring together researchers across a broad spectrum of scientific areas and disease types and increase the impact of data generated across research programs. In the long term, this system will pave the way for novel single cell diagnostics and discovery of new drug strategies for improved health care.

Software for Peptide Identification and Quantification from Large Mass Spectrometry Data using Data Independent Acquisition

Overview

Precision medicine gives patients the opportunity to tailor medical and treatment decisions at the individual level to maximize outcomes and minimize adverse effects. It can be used to treat a wide variety of diseases, including cancer. Decisions are often based on the presence and quantity of biomarkers such as proteins in the blood or tissue samples.

Advances in mass spectrometry instruments have made it feasible to discover and measure protein biomarkers, but researchers lack the necessary bioinformatics software to analyze the data. Drs. Bin Ma of the University of Waterloo and Michael Moran of the Hospital for Sick Children are developing this software to enable more sensitive and accurate protein identification and quantification from the mass spectrometry data generated using a method called data independent acquisition (DIA). They expect that their software will significantly increase the total number of proteins identified and quantified in comparison to existing DIA analytical software. It will be especially effective with post-translational modifications (PTMs), which are critical biomarkers in a proteins’ function and degradation.

The free availability of the software to academic labs coupled with its superior performance can help health researchers discover and trace disease biomarkers. Within the next decade, the software could become an indispensable tool for many proteomics labs performing DIA analysis throughout the world. The new software may also help commercial partners create value-added new products, services and jobs.

Ultimately, this will lead to improvements in human health and reduction in healthcare costs by enabling early disease detection and diagnosis and by facilitating the selection of optimal treatment for individual patients.