Visualizing genomic data: techniques and challenges.

Posted on 27 June, 2017 by Carlo Pecoraro

The advent of Next Generation Sequencing technologies has allowed the rapid production of massive amounts of genomic data, and created a corresponding need for new tools and methods for visualizing and interpreting these data in order to extract relevant biological information.

Visualizing genomic data requires more than simply plotting data. It requires a decision: what message should be conveyed by a particular plot? And, it often presents a choice: which methodologies should be used to represent the results in an easy, clear and accurate way to the users? Interpretation and visualization of genomic data, often consisting of thousands to billions of data points, and extracting biological meaning remains a serious challenge.

We run a workshop on Genomic Data Visualization and Interpretation (Sept 11-15, 2017) with the twin scientists Dr. Malachi Griffith (Malachi's website) and Dr. Obi Griffith (Obi's website), who lead a combined research group (The Griffith lab) at the McDonnell Genome Institute at Washington University. They will be joined by Mr. Zachary Skidmore, a staff scientist in their group and expert in genomic visualization. The Griffith lab’s research is focused on developing methods of applied bioinformatics for genomic data analysis, personalized medicine and improved cancer care. They have participated in dozens of genomic studies from the earliest sequencing of full-length human genes and whole genomes, initial surveys of the mutational landscape of multiple cancer subtypes, and some of the first proof-of-principle applications of sequencing to cancer precision medicine. They have a strong commitment to developing open-source tools and knowledgebases for cancer gene and variant interpretation including pVAC-seq (github.com/griffithlab/pVAC-Seq), CIViC (civicdb.org), DGIdb (dgidb.org), and DoCM (docm.info). They are also the creators of GenVisR (bioconductor.org/packages/release/bioc/html/GenVisR.html), a Bioconductor package, which provides a user-friendly, flexible and comprehensive suite of tools for visualizing complex genomic data for multiple species of interest. In this Workshop we will explore a number of best-in-class visualization tools, and we will provide working examples that demonstrate important principles of ‘omic interpretation strategies.

Here is the course page where you can find all the tool available to visualize your genomics data: http://genviz.org/

Here we have the possibility to discuss this topic with them:

1) You are both leading the Griffith Lab – could you please tell us about the different research topics of your group? How has the combination of your different bioinformatics skills and knowledge been important in accomplishing scientific objectives and key results?

Our lab’s activities can be divided into four broad categories: cancer genomics studies, personalized medicine efforts, bioinformatics tool development, and educational efforts such as this workshop. Our group conducts data analysis of a wide array of data types. Major sources of data include whole genome, exome and RNA-seq. We are also involved in many projects that involve a wide variety of custom gene panel assays. We are increasingly working with T-cell receptor (TCR) sequencing and cell-free DNA (cfDNA) sequence data. We don’t focus on a particular cancer but have many projects involving AML, lung, head and neck cancer, liver cancer, and breast cancer. To help translate results from these next generation sequencing assays to clinical application we develop open source tools and online resources for interpretation of data in a personalized genomics context. These include the drug-gene interaction database (www.dgidb.org), the database of curated mutations (www.docm.info), the personalized Variant Antigens by Cancer Sequencing (pVAC-seq) pipeline (github.com/griffithlab/pVAC-Seq), and the Clinical Interpretation of Variants in Cancer (CIViC) resource (www.civicdb.org).

The products of our lab are really the result of the hard efforts of the ~25 trainees, bioinformaticians, data analysts, and software developers in our group. We have both been doing bioinformatics at large Genome Institutes for close to 15 years. Our training is similar but with some specializations in areas such as survival analysis, machine learning, and software development. While we were trained mostly at the same institutions, our exposure to different PhD supervisors, and distinct post-doctoral fellowships, provide unique perspectives that are now complementary.

2) In your opinion, what remains the main bottleneck between genomic data generation and their subsequent visualization and interpretation?

The bioinformatics community has standardized file types for preliminary processing of data (fastq, bam, etc.). However this advantage disappears the further you go downstream in the analysis workflow. Fusion and structural variant callers are a great example of this, there are many fusion callers available, each presenting the same data, but in a slightly different format. This requires the research scientist to take the time to normalize and reformat the data before additional analysis and interpretation. Furthermore, the large scale of data often prevents simple analysis approaches and may require more advanced knowledge of computational techniques. Finally, as genomic data generation has been increasingly automated, knowledge of the molecular principles and assumptions underlying the data can also be a limitation to appropriate interpretation. A principle goal of our lab, and this workshop, is to help overcome this bottleneck.

3) Your course is mainly based on R and Bioconductor – Could you please explain to us why they are so flexible and powerful for interpreting and visualizing genomic data?

The Bioconductor resource is one of the main factors that makes R so attractive to the bioinformatics community. With Over 1,350 packages devoted to computational biology/bioinformatics, no other programming language has such a specialized resource. In addition to Bioconductor, many other fundamentally useful statistical approaches and algorithms have been implemented in R and made freely available by a vibrant and active community. This allows a researcher to rapidly reuse other’s genomic analysis tools. It also allows us to apply cutting edge analysis and statistical techniques developed in other fields to genomics data in novel and innovative ways. In our lab we regularly use a combination of C/C++, Java, Python, Ruby/Rails, Javascript (Angular/D3), typescript, Perl, and other languages. Each of these is used as appropriate for the task, but R remains a major tool for data visualization and figure generation, data exploration for hypothesis generation, and statistics for hypothesis testing.

4) You have created a new Bioconductor package “GenVisR”. What led you to create this package? What are the main advantages in using it for visualizing genomic data?

In conducting many genome data analyses and preparing figures for publications we found that we were creating specific types of plots over and over, a process that could take hours to days. We also receive an increasing number of requests for such figures from current and prospective collaborators, far more than we could realistically satisfy. GenVisR is our attempt to streamline this process to allow for more time interpreting data and less time performing repetitive coding tasks. We also hope that it allows research labs with genomic data to generate their own visualizations without needing to always collaborate with a bioinformatics lab.

5) Apart from GenVisR, which are your favourite packages for visualizing your genomic data?

Ggplot2 and its derivatives (ggviz, plotly) are great packages. GenVisR is actually built on ggplot. Nothing allows for more flexibility and customization while producing publication quality graphics. ggbio is a great package on bioconductor for visualizing genomic data on tracks as well. If you want to learn all about the many visualization tools we use on a daily basis, please join the course!

6) During the workshop, attendees will learn to visualize and interpret results from real human genome data sets generated at the McDonnell Genome Institute at Washington University School of Medicine. Could those tools be also used for non-model species datasets?

Yes, GenVisR supports multiple species including non-model species as do many of the tools we will cover.

7) The continued development of sequencing technologies will lead to production of increasingly large genomic data sets – in your opinion which kind of tools will we use in the future to interpret and visualize this growing amount of data?

Interactive graphics have been an exciting area for a while, allowing for researchers to actually explore data in real time in an intuitive format. We think this area will continue to expand and we’ll see more scientists displaying results with microservers such as Shiny in R or even through traditional websites using modern graphical application frameworks such as D3. Large datasets will also increasingly drive data analysts to cloud computing platforms where the scalability and elasticity of the compute environment can be leveraged.

Many thanks Obi, Malachi and Zachary for your time and see you in Berlin!

https://www.physalia-courses.org/courses/course14/