The power and promise of Genome-WIDE ASSOCIATION studies

Posted on 26 July, 2017 by Carlo Pecoraro

The genetic sources of phenotypic variation have been a major focus of both plant and animal studies aimed at identifying the causes of disease, improving agriculture and understanding adaptive processes. To address those questions, genome-wide association studies (GWASs) emerged through the recent advent of next-generation sequencing (NGS) technologies. This approach searches for hundreds or thousands of SNPs across the genome to survey the most common genetic variation for a role in a disease or to identify the heritable quantitative traits that are risk factors for diseases.

We will run a course on "Practical GWAS using Linux and R" from the 23rd to the 27th of October 2017, with Dr. Jing Hua Zhao (https://www.physalia-courses.org/courses-workshops/course15/). Here we have the possibility to discuss with him about GWASs’ power, efficiency, comprehensiveness, interpretation and analysis.

Dr. Zhao joined the MRC Epidemiology Unit, University of Cambridge, to work on design and analysis of GWAS such as the EPIC-Norfolk, the Fenland and the InterAct. He has also participated in numerous genetic analysis workshops which involve both simulated and real data such as those from the Framingham heart study. Besides methodological development, data analysis, and other academic activities, he has also had tutorials on genetic dissection of complex traits with focus on GWAS at UseR! 2008, 2009, and 2010 Conferences and contributed a Henry-Stewart talk on genetic association with R.

1) When did you start using a genome-wide association approach in your work? And how has this approach changed since then?

JHZ: The first GWAS was reported back in 2005, when I joined the MRC Epidemiology Unit following six years of work on statistical genetics at King’s College London and three years on social and genetic epidemiology at University College London. I came across many concepts/techniques in genetic epidemiology earlier, e.g., Morton et al. (1983) Methods in Genetic Epidemiology was the bible in the field. Since then, there were a lot of discussions on study design and statistical analysis at the Unit, esp. with respect to EPIC-Norfolk GWAS involving ~4,000 individuals in England and the InterAct project involving many institutions across Europe. Initially, GWAS was only about single nucleotide polymorphisms (SNPs) on genechips such as those from Affymetrix but with eminent studies such as the Wellcome Trust Case-Control Consortium (WTCCC) the number of polymorphisms were substantially enlarged, first by the HapMap, then the 1000Genomes, and followed by UK10K, etc. The EPIC-Norfolk GWAS used a case-cohort design such that the cohort is a representative sample of the general population, allowing for contribution of GWAS results to a variety of consortia whereas the focus of InterAct was gene-environment interaction. As the technology developed, we were able to genotype all individuals in EPIC-Norfolk using genechips and measure methylation on a subsample. The InterAct and Fenland studies also involved genotypes from multiple platforms. Recent development in GWAS is made in many ways, including deeper analysis such as next-generation sequencing data, rare variant analysis, pathway analysis, transcriptomewide/epigenomewide association studies, among others. The findings from GWAS were also used in causal inference through Mendelian randomisation and prediction. Much covered in the media are precision medicine, BioBank, government/corporate genomics initiatives. We are currently analysing UK BioBank with ~1/2 million individuals and Axiom chips imputed to 1000Genomes to include large number SNPs. Nowadays it is considerably faster to share your ideas and software via platform such as GitHub (yet, not the usual WWW). Many methods were developed for GWAS summary statistics.

2) Statistically speaking, what are the main difficulties in detecting SNPs significant associated with specific diseases or quantitative traits?

JHZ: There were many reviews on this topic so I will be brief. A practical difficulty lies in inability to handle the amount of data; and also to interpret results, especially the false positives not seen in traditional statistical analysis, i.e., statistical significance due to artefact. Another one is to do with measurement error in disease ascertainment. Eric Lander and Nicholas Schork had a review in 1994 on Science documenting these in genetic dissection of complex traits.

3) Can rare alleles be detected by association methods?

JHZ: This is relatively less developed as it is much easier to study common variants predisposing disease/trait. From our experience, over the past ten years or so, only a few calls from consortia to study rare variants. Gradually it will be picked up; for instance at the moment GLGC/GIANT (Global Lipids Genetic Consortium/Genetic Investigation of Anthropometric Traits) are working on a range of traits and models using computer software rvtests implementing many methods for rare variants. I am sure other consortia will follow suite. A lot of statistical methods were developed to boost the power.

4) Your course is mainly based on the use of R and Linux: could you please explain us the advantages of using them to analyse those data?

JHZ: It is hard to imagine without the Linux system to work, which hosts a great deal of software for office automation, data management, Internet facility, statistical analysis including R. Most importantly, there is no comparison with its efficiency and easy to work on GWAS. I had reviews on Human Genetics and Current Bioinformatics both appeared in 2006. There has been an explosion of R package development (now over 10,000 packages on CRAN, not to mention those hosted by BioConductor) and dissemination over the past 15 years, as featured in journal such as Nature (Nature careers) and newspaper such as the New York Times (January 6, 2009). My course will cover software not in R but running under Linux; R alone without the Linux component the scope is limited.

5) Which are your suggestions for those newcomers in this field? And how could they benefit of attending your course?

JHZ: My initial training was medicine/public health, and then statistics. However, even in a very established environment it is difficult to slot in a position perfect for me but rather you would soon find yourselves working on many statistical and computational problems, including their setup. I understand the focus for many is the interpretation, but you could only get there when you had a good sense of your own data analysis. While working mostly under Linux at work, my home computers have VirtualBox enabling Linux systems such as Fedora and Ubuntu, which helps me a lot in tackling problems at work and understanding what is available on the Internet.

6) In your opinion which are the future directions of GWAS?

JHZ: It certainly gets larger and wider. The fusion of findings from GWAS with other –Omics and related fields of public health projects will continue to be the driving force for biomedical research. Examples include Genomics and BioBank projects across the world. The 100,000 Genomes Project aims to bring the benefits of personalised medicine to the NHS. To make sure patients benefit from innovations in genomics, the UK Government has committed to sequencing 100,000 whole human genomes, from 70,000 patients, by the end of 2017. The Cambridge Biomedical Campus, where our Unit is located together with Addenbrroke’s hospital, Laboratory of Molecular Biology, Cambridge Cancer Institute, is being actively developed to include AstraZenica HQ, Papworth hospital soon.

Many thanks Jing Hua and see you in Berlin!!!