Enhanced and automated visualization of complex data

Overview

Modern genomics research generates massive amounts of data. But these data sets are too big and complex to be useful on their own. Researchers must first analyze and interpret biological data to better understand them and turn them into meaningful information. This information can then be used to help solve real-world problems, such as developing new tools or strategies to better diagnose and treat patients, increasing crop yields or monitoring the environment. Increasingly, the ability of the human end-user to interpret the data is the key factor limiting researchers from delivering these much-needed solutions more quickly.

Dr. Paul C. Boutros of the Ontario Institute for Cancer Research is leading a team developing ways of making “big data” results more easily understood by improving the way it is visualized and interpreted. The team will create interactive visualization tools that will integrate tightly with databases scientists already use routinely. The team will use crowdsourcing to capture the best visualization ideas from a broad community of scientists, graphic designers and citizen-scientists. The project will build on the human brain’s ability to interpret images, to make the conclusions of biological data more readily accessible and accelerate the rate of biological discovery and innovation.

Extracting Signal from Noise: Big Biodiversity Analysis from High-Throughput Sequence Data

Overview

Surveying biodiversity is critical for environmental health and for managing natural resources. It helps to assess the impact of resource development, but also to identify pests, invasive species, and pathogens in a rapid and cost-effective manner. It is essential to Canada’s economic growth in the forestry, agriculture, and fishery sectors and to decision-making in public health. Genetic methods of surveying biodiversity, such as high-throughput sequencing, are being broadly adopted, but bioinformatics has not kept pace with the data being generated. In addition, current methods are geared toward bacteria and similar organisms, rather than multi-celled plants and animals that need monitoring as well.

Drs. Sarah Adamowicz and Paul Hebert, along with colleagues from the University of Guelph, are creating new bioinformatics tools that will facilitate the rapid and accurate processing of DNA data resulting from high-throughput sequencing. The tools will enable the simultaneous analysis of bulk samples, which are made up of many different species. It will include a de-noising tool to detect errors; a method to cluster DNA sequences into species-like units to permit biodiversity analysis; and a method for assigning sequencing data to higher taxonomic categories to unlock functional biological information. The team will combine these various tools into a biodiversity informatics pipeline that can be incorporated into existing web-based platforms for uptake by a broad variety of users.

The new biodiversity informatics tools will support large-scale biodiversity research by academics; efficient, accurate, and cost-effective environmental assessments for the mining and pulp-and-paper industries; enhanced capacity and accuracy of regulation; and more rapid and accurate biodiversity data for government and private-sector decision-makers.

From ePlants to eEcosystems: New Frameworks and Tools for Sharing, Accessing, Exploring and Integrating ‘Omic Data from Plants

Overview

Major advances in plant biology over the past decade are in large part thanks to new technologies for DNA sequencing and phenotyping (i.e. mapping the physical expression of genetic traits). The resulting datasets allow researchers to determine how different plants develop and respond to changes in their environment. Yet, in order to take advantage of the tremendous amount of new data, innovative tools are required to integrate and visualize the number of individual data points in different datasets. The original ePlant system, developed as part of a previous Genome Canada effort, integrates many data types but was not configured for phenotype data. Amongst its many applications, phenotype data provide important information on traits of interest to plant breeders.

Drs. Nicholas Provart of the University of Toronto and Jörg Bohlmann of the University of British Columbia are developing a new module to integrate the wide variety of data available, including ecosystem data, phenotypes and genotypes into ePlant. This will be done for the already existing ePlant species and any new ePlant species to be developed as part of this project. The researchers will also open the ePlant system to the research community to build a larger ePlant ecosystem of information. This online system will act as a resource where plant biologists will be able to share their datasets.

Ultimately, these tools can help to accelerate the task of identifying useful genes to feed, shelter and power a world of nine billion people by the year 2050.

Dockstore 2.0: Enhancing a community platform for sharing cloud-agnostic research tools

Overview

With Genome Canada support, Dr. Lincoln Stein of the Ontario Institute for Cancer Research successfully developed Dockstore, a system that enables complex computational biology algorithms to be run reliably and reproducibly across multiple platforms. It has been adopted as the leading packaging technology by the Global Alliance for Genomics and Health and is now used by numerous third-party bioinformatics groups. Marc Fiume of the Canadian company DNAstack is collaborating with Dr. Stein and his team to maximize the utility of Dockstore.

The aim of these enhancements is to promote greater collaboration and sharing among computational biology software developers. Specifically, the enhancements will make Dockstore easier to use, make its packages more powerful and expressive, increase its interoperability and enable these packages to run more easily on a wide range of systems and hardware architectures. The bioinformatics and computational biology community will benefit from this software, while the research results derived from it that are reproducible, portable and reusable.

CReSCENT: CanceR Single Cell ExpressioN Toolkit

Overview

Tumours are complex mixtures of cancer, immune, and normal cells that interact and change during treatment. The interplay of all three types of cells can dictate development of cancer over time, as well as response or resistance to treatments. Recent advances in microfluidic and DNA sequencing technologies have enabled researchers to simultaneously analyze tens of thousands of single cells from complex tissues, including tumours. Interpreting these data is challenging, due to the lack of high-quality reference sets of each cell type in the body and a lack of methods to link these data back to tumour biology.

Drs. Trevor Pugh of the Princess Margaret Cancer Centre and Michael Brudno of The Hospital for Sick Children are developing the CanceR Single Cell ExpressioN Toolkit (CReSCENT), a scalable and standardized set of novel algorithmic methods, tools, and a data portal deployed on cloud computing infrastructure. To allow comparison of cells in cancerous and healthy tissues, the system will aggregate single-cell genomic data generated by cancer researchers and connect them to international reference data generated by experts from around the world as part of the Human Cell Atlas. This data sharing and aggregation system is a key differentiating factor in CReSCENT that will increase researcher productivity by accelerating execution and comparison of computational methods, as well as providing contextual data for understanding how cells behave within tumour tissues.

This platform, which will be useable by any researcher on any computing platform, will assemble a crucial data resource to navigate the upcoming wave of single cell cancer genomics research. CReSCENT will bring together researchers across a broad spectrum of scientific areas and disease types and increase the impact of data generated across research programs. In the long term, this system will pave the way for novel single cell diagnostics and discovery of new drug strategies for improved health care.

Software for Peptide Identification and Quantification from Large Mass Spectrometry Data using Data Independent Acquisition

Overview

Precision medicine gives patients the opportunity to tailor medical and treatment decisions at the individual level to maximize outcomes and minimize adverse effects. It can be used to treat a wide variety of diseases, including cancer. Decisions are often based on the presence and quantity of biomarkers such as proteins in the blood or tissue samples.

Advances in mass spectrometry instruments have made it feasible to discover and measure protein biomarkers, but researchers lack the necessary bioinformatics software to analyze the data. Drs. Bin Ma of the University of Waterloo and Michael Moran of the Hospital for Sick Children are developing this software to enable more sensitive and accurate protein identification and quantification from the mass spectrometry data generated using a method called data independent acquisition (DIA). They expect that their software will significantly increase the total number of proteins identified and quantified in comparison to existing DIA analytical software. It will be especially effective with post-translational modifications (PTMs), which are critical biomarkers in a proteins’ function and degradation.

The free availability of the software to academic labs coupled with its superior performance can help health researchers discover and trace disease biomarkers. Within the next decade, the software could become an indispensable tool for many proteomics labs performing DIA analysis throughout the world. The new software may also help commercial partners create value-added new products, services and jobs.

Ultimately, this will lead to improvements in human health and reduction in healthcare costs by enabling early disease detection and diagnosis and by facilitating the selection of optimal treatment for individual patients.

SYNERGx: a computational framework for drug combination synergy prediction

Overview

When just one drug is used to treat cancer, the patient may not respond, or may develop resistance to it. Combination therapy, where two or more drugs are used in treatment, is more likely to be successful. Yet, it is impossible to test all drug combinations in clinical trials due to the high cost of required resources and certain ethical considerations. Computational techniques are therefore required to model the large amount of available data to improve current cancer treatment strategies and propose more efficient combinations of drugs.

Dr. Benjamin Haibe-Kains of the Princess Margaret Cancer Centre is developing SYNERGx, a new computational platform that will integrate multiple pharmacogenomic datasets. These datasets will be used to predict possible combinations of known drugs that can act in synergy, meaning that their combined therapeutic efficacy is greater than the sum of their individual effects.

The platform will implement analytic tools to improve modeling of synergistic drug effects. Users will have access to highly curated drug-combination pharmacogenetics data and an open-source machine-learning pipeline for drug synergy prediction. SYNERGx will also implement a new way to optimize drug-screening studies to identify novel synergistic combinations that can be further validated in preclinical studies and then in clinical trials.

SYNERGx will provide an efficient way to leverage massive investments in pharmacogenomics studies by allowing the integration of otherwise disparate datasets. It represents a major step forward in the design of new therapeutic strategies for cancer.

Computational tools for Data-Independent Acquisition (DIA) for quantitative proteomics and metabolomics

Overview

When cells lose control over their own behaviour or communication with other cells, diseases such as diabetes or cancer can arise. Protein and small molecule metabolites are responsible for cells’ behaviour, so identifying and quantifying these molecules is key to understanding how disease happens and how to prevent it.

Mass spectrometry has become the workhorse for proteomics and metabolomics. Drs. Anne-Claude Gingras of the Lunenfeld-Tanenbaum Research Institute and Hannes Röst of the Donnelly Centre for Cellular & Biomolecular Research at the University of Toronto are working with a technology called Data-Independent Acquisition (DIA), in which the mass spectrometer systematically identifies and quantifies the proteins and metabolites present in a sample. DIA has been shown to improve quantitative accuracy, reproducibility and throughput over other methods. Since its introduction, however, this approach has only been applied to small-scale studies and in a relatively small number of laboratories. Limitations to this method are due to the lack of user-friendly software that could enable a scalable analysis of the complex data generated in large-scale biomedical and medical research.

The project builds on the team’s proven strength in DIA data analysis and software development and will result in an integrated set of tools available under an open-source license. To encourage uptake of these tool, documentation, webinars and workshops will be made available to potential users. The results of the project could have long-lasting impact on the health sector in Canada by facilitating research into the root causes of disease and assisting with clinical questions such as patient stratification.

BridGE-SGA: A novel computational platform to discover genetic interactions underlying human disease

Overview

The ability to sequence the entire human genome at increasingly lower cost has led to a fundamental change in biomedical research. But there is a gap between the amount of data available and our ability to understand and interpret that data. Addressing this gap is essential to realize the promise of precision medicine.

Dr. Charles Boone and Dr. Brenda Andrews of the Donnelly Centre for Cellular and Biomolecular Research at the University of Toronto, and Dr. Chad Myers of the University of Minnesota, have worked together to discover that a significant part of our inability to interpret genomic data likely stems from the reality that disease generally arises from complex genetic interactions. While all humans essentially have the same set of genes, most have around five million unique genetic variants. The effect of any one variant depends on its interactions with other variants. So we need to understand not just the millions of genetic differences that affect gene function, but also how all those genes interact with each other. Current computational methods and technologies lack the statistical power to do so.

Drs. Boone, Andrews, Myers have developed the first complete genetic interaction map for any organism, and have built a computational method, BridGE, to discover genetic interactions. The team is now working to develop an innovative computational platform for genome sequencing data, BridGE-SGA, to enable the discovery of disease-associated genetic interactions from large-scale human genotype data. Their goal is to discover genetic interactions for a variety of diseases. Identifying and understanding these key genetic interactions will improve our ability to interpret data from whole genome sequencing and identify novel gene targets for drug discovery and development.

Stratifying and targeting pediatric medulloblastoma through genomics (2010)

Overview

Understanding childhood brain cancer. Brain cancer is the leading cause of pediatric cancer deaths. Children who survive have a much poorer quality of life due to the aggressive treatment used to fight the disease. This results in a staggering burden of suffering for them and their families as well as economic costs of over $100 million annually to the health system. Studies indicate that children with a good prognosis are often over-­treated and could be spared complications by reducing the amount of treatment they receive. At the same time, children with a poor prognosis are often subjected to painful treatments which may, in fact, be futile. With support from Genome Canada, scientists are using genome wide approaches to study medulloblastomas, the most common form of childhood brain cancer, to develop markers that will more accurately classify the tumors for treatment. Researchers are also identifying genetic changes that may reveal the risk factors that predispose children to this type of cancer. As they unravel the genetic basis of brain cancer, the research team is also working with families to determine what additional risks they are willing to assume in reducing therapy to improve quality of life. It is anticipated that the results of this research will lead to new ways to treat childhood brain cancers more effectively and to enhance the quality of life of children struck by this devastating disease.