Advancing Big Data Science in Genomics Research

In 2013, the Natural Sciences and Engineering Research Council of Canada (NSERC), Genome Canada, the Canadian Institutes of Health Research (CIHR) and the Canada Foundation for Innovation (CFI) partnered on a 2013 Discovery Frontiers call for proposals, focused on advancing big data science in genomics research. This initiative was designed to support the development of tools and methodologies to integrate currently available complex data sets in the fields of ‘omics sciences with each other, as well as with phenotypic data and data from other related fields of biological sciences. It was aimed at building on past and ongoing investments in this area, the most recent being the Bioinformatics and Computational Biology Request for Applications launched in June 2012 by Genome Canada and CIHR.

The result of the Competition was announced April 30, 2014. Federal funding totaling $5.6 million was awarded to the Ontario-led project to support an unprecedented collaboration – both in Canada and internationally – to develop tools that can effectively manipulate vast amounts of data to help find cures for cancer.

The Cancer Genome Collaboratory

Project Leader: Lincoln Stein
Institution: Ontario Institute for Cancer Research
Total Project Funding: $5.6 million

This project will set up a unique cloud computing facility which will enable research on the world’s largest and most comprehensive cancer genome dataset. Using the facilities of the Cancer Genome Collaboratory, researchers will be able run complex data mining and analysis operations across 10 to 15 petabytes of cancer genome sequences and their associated donor clinical information.

Using advanced metadata tagging, provenance tracking, and workflow management software, researchers will be able to execute complex analytic pipelines, create reproducible traces of each computational step, and share methods and results. This represents a fundamental reversal in the current practice of genome analysis. Rather than requiring researchers to spend weeks downloading hundreds of terabytes of data from a central repository before computations can begin, researchers will upload their analytic software into the Collaboratory cloud, run it, and download the compiled results in a secure fashion.

Since the genetic data used in the Collaboratory is so detailed as to permit personal identification, privacy issues are central to the project’s design. A special team of computer scientists will investigate ways to guard the privacy of everyone whose data are analyzed. These will include techniques to make genetic profiles anonymous without the loss of details that would render the profiles overly vague, and techniques to structure queries from health researchers so they can be processed via secure data storage sites.