Population genomic inference from low-coverage whole-genome sequencing data

Population genomic inference from low-coverage whole-genome sequencing data

Dates

21-24 October 2024

To foster international participation, this course will be held online

Overview

Low-coverage sequencing provides a cost-effective means to survey variation across the entire genome at a population scale, with broad applications in population, evolutionary, and medical genetics. However, the ability to make accurate inference using this approach necessitates a probabilistic framework due to high genotyping uncertainty, which prevents the use of standard analysis programs. In this course, we will explore workflows revolving around genotype likelihoods applicable to both whole genome and reduced representation studies and we will explore the underlying rationale behind producing, processing, and analyzing low-coverage sequencing data for population genomic inference. We will primarily cover methods and algorithms implemented in the ANGSD software package and associated programs, providing best-practice guidelines and discussion of how participants can make maximal use of low-coverage genomic re-sequencing data for their studies.

Target audience

The course is aimed at researchers who might have previous experience with next generation sequencing (NGS) data (e.g. exome/RAD/pooled sequencing) and wish to explore the potential for using low-coverage whole-genome sequencing for their studies. Researchers who want an introduction to the ANGSD software package and related software based on genotype likelihoods, and an understanding of their inherent probabilistic framework, will benefit from this course.

Prerequisites

We will assume that participants have a basic background in population genomics and basic familiarity with NGS data. Previous experience with UNIX-based command line and R is also an advantage. We will not have time to comprehensively introduce these computing environments during the course, so we ask participants without previous experience in Unix and R to go through suggested tutorials on their own prior to the course. All hands-on exercises will be run in a Linux environment on remote servers. Statistical analyses and data visualization will be run in R.

Outcomes

After attending the course, the participants will appreciate the use of whole-genome sequencing for population genomics. The participants will be able to demonstrate the challenges associated with low-coverage sequencing data and will have an intuition about the statistical framework implemented in ANGSD/ngsTools/Atlas. They will be familiar with building a bioinformatic pipeline to process low-coverage sequencing data to perform different types of population genomic analyses, such as inference of demographic histories and detection of signatures of natural selection.

Teaching format

The course will comprise a mix of interactive lectures with small exercises followed by a longer independent practical each day. Data will be provided for exercises.

Program

Monday. Classes from 2 to 8 pm Berlin time

We will discuss the rationale behind the use of low-coverage whole-genome sequencing for population genomic inference. We will then walk through typical workflows for going from sample to raw sequencing data, and from raw sequencing data to processed alignment files ready for downstream analysis.

Tuesday. Classes from 2 to 8 pm Berlin time

We will introduce the concept of sequence data uncertainty and genotype likelihoods. We will then discuss the currently available software for population genetic analysis from low-coverage data. Finally, we will explore how to estimate allele frequencies and perform SNP calling with ANGSD.

Wednesday. Classes from 2 to 8 pm Berlin time

We will explore how to infer population structure and demographic parameters from low-coverage sequencing data using ANGSD. We will focus on performing principal component analysis and admixture inference. We will also discuss how to estimate the site frequency spectrum and inbreeding coefficients.

Thursday. Classes from 2 to 8 pm Berlin time

We will explore how to estimate summary statistics to detect signals of natural selection. Specifically, we will explain the protocol for estimating genetic variation indexes at the genome-wide level and in sliding-windows with ANGSD. We will focus on both single-population metrics, such as Tajima’s D and linkage disequilibrium, and multi-population metrics, such as FST and PBS.

Instructors

Dr. Nina Overgaard Therkildsen (Cornell University, US)

Dr. Tyler Linderoth (University of Cambridge, UK)

Dr. Arne Jacobs (University of Glasgow, UK)

Dr. Nicolas Lou (University of California, Berkeley, US)

COst overview

Package 1

530 €

Register now

Cancellation Policy:

> 30 days before the start date = 30% cancellation fee

< 30 days before the start date= No Refund.

Physalia-courses cannot be held responsible for any travel fees, accommodation or other expenses incurred to you as a result of the cancellation.