metabarcoding: Transforming the way in which biodiversity can be surveyed

Posted on 09th January, 2018 by Elizabeth Bourne

In the last 15 years, DNA barcoding has transformed the traditional way we can study biodiversity. Now the field is shifting from barcoding individuals to metabarcoding communities. Metabarcoding is a rapid method of biodiversity assessment that involves high-throughput DNA sequencing, bioinformatics pipelines, computational infrastructure, and experimental designs.

Here we have the possibility to discuss about this approach with Dr. Vasco Elbrecht (University of Guelph, Canada) and Dr. Owen S. Wangensteen (University of Salford, UK).

How have you applied metabarcoding to your research fields, and how has this field developed since you started?

Vasco: When I started working with metabarcoding in 2014 I wanted to use this tool to identify freshwater macroinvertebrates I did collect from a mesocosm field experiment, testing the influence of multiple stressors. However, when I wanted to apply the method, I realized that scientists are starting to apply metabarcoding to answer ecological questions, but methodical biases are widely unexplored. Since then I have been exploring potential issues like primer bias, biomass influence, tagging systems and other steps in the workflow. I am very happy that method validation has been moved more into the focus of the metabarcoding community, and we are now at a stage where metabarcoding can be reliably and routinely been used in monitoring and research. However, there are still many aspects of metabarcoding that should be further improved and validated. Additionally, the increased usage and demand for metabarcoding methods generates excellent career opportunities for aspiring students and researchers. Here it's close to my heart to keep the growing metabarcoding community open, friendly and supportive, thus I share and discuss my views on the field where ever I can.

Owen: I started working with very complex communities, aiming at analysing the whole eukaryotic diversity of marine benthos. When we started, nobody wanted to use COI as a marker. They mostly worked on eukaryotic microbial samples using almost exclusively 18S, since they argued that no truly universal primers for COI could be developed. This approach had the big limitation of the low taxonomic resolution that less variable markers, such as 18S, can offer, compared to COI. Since then, both nearly-universal primers and taxon-specific primers have been developed for COI, which are able to detect most of the eukaryotic diversity (including virtually all metazoans) with reduced primer bias. Also, we are now aware of the great practical differences that exist between performing metabarcoding of extra-organismal eDNA from low-abundance species and performing community-DNA metabarcoding of tissue-enriched bulk samples. These differences must be considered right from the beginning of the project, since sampling design, reproducibility, number of replicates needed, sample pre-treatment or probability of contaminations can be very different in both kinds of approach.

What are the most important points to keep in mind when setting up a metabarcoding survey/experiment?

Vasco: Having a good research question and certainty that metabarcoding is the right tool is key, before starting a project. Additionally, the project should go through several rounds of feedback from your colleges (also some who are not familiar with the project), and it should be verified that the metabarcoding methodology established in your lab is sufficiently validated for the research question and taxonomic group you are targeting. If not, consider running a small scale validation study fist, those make easy to write papers which still can have a major scientific impact. Once you are confident the metabarcoding methodology you are planning to use is reliable and appropriate, keep in mind to include negative/positive controls and possibly replicates in your study. This will make your life easier in the bioinformatics step while giving the readers of your study confidence that your results are reliable. Also, make sure to sequence each sample with sufficient depth, cut down the number of samples in your study rather than sequencing to shallow to answer your research question.

Owen: There are many important aspects to look at! For me, the most important single point to consider is: do I want to perform a survey of low-abundance species based on very low concentrations of extra-organismal environmental DNA or do I want to perform an assessment based on well-preserved community-DNA from tissue-enriched bulk samples? This will have profound implications for the choice of marker, the sampling design (with the number of replicate samples needed), the analytical pre-treatments and the bioinformatics pipelines to use. For example, if I want to detect sharks from seawater samples, I would need a very specific primer set for sharks (otherwise, I will get mainly useless reads from bacterial or micro-eukaryotic DNA unspecific amplifications). I will not be so interested in quantification, but just in detection of presence/absence. And I will need many replicates, both ecological and technical PCR replicates, given the high levels of stochasticity expected for my data. Conversely, if I want to amplify and detect bulk DNA from a mix of blended insect specimens, I will need more universal primers, capable to amplify most species with low primer bias, I will probably be more interested in the quantitative value of my data, and I will probably need less replication, since the amplification of a tissue-enriched mix with universal primers is a less random process than to detect a small piece of shark DNA in the vastness of the sea.

There are many clustering methods/algorithms available for generating OTUs. Could you provide some guidance/an overview of some of these, and do you have any favoured approaches?

Vasco: That's a tough question. Personally, I prefer the UPARSE-OTU algorithm by Robert Edgar, as it includes advanced chimera removal and is quite quick. However, one has to be aware that fixed threshold algorithms might merge several species with low genetic diversity into a single OTU or cluster sequences from a single species into several OTUs due to high diversity or sequencing errors. Here clustering methods like swarm can help, which employs a more flexible clustering threshold (but might oversplit OTUs). At the end of the day, the awareness of potential biases and limitations of the chosen clustering algorithm is key. No algorithm is perfect, and results are highly dependent on the targeted organisms, laboratory protocols and bioinformatic data filtering steps. As OTU numbers are affected by so many other parameters than clustering, they should be treated with caution. Even when assigning OTUs to reference databases, it should be kept in mind that not all sequences might have correct taxonomy assigned. Finding and choosing an appropriate clustering algorithm is critical, but the metabarcoding steps leading to the clustering step are as important.

Owen: Fixed-threshold clustering algorithms are widespread in prokaryotic metabarcoding, and they are actually used as an operational definition for MOTU in Microbiology. However, I have become convinced that no clustering procedure based on a fixed-threshold can reliably represent the true biological diversity of an eukaryotic sample. Eukaryotic lineages show great variability in sequence diversity at so many levels, including taxonomic groups, times of divergence of different lineages, and evolutionary mutation rates. Morphological variation is uncoupled of sequence variation, and identity thresholds between species can thus vary a lot. After working with taxonomically wide datasets, I have come to the conclusion that step-by-step aggregation algorithms, such as Swarm v2, are most useful to reflect the expected diversity of eukaryotic samples. Using Swarm v2, the resulting MOTU networks can feature low variability (e.g. 99% identity) or wide variability (e.g. 90% identity), reflecting the natural diversity of the different lineages. And the best news is that these calculations can be performed in a repeatable, deterministic way, using very short computational times! To me, Swarm v2 is currently the best clustering solution, and it can be applied to diverse markers, if you choose the right values for the distance parameter.

Metabarcoding can introduce a number of biases into the data, at various points in the protocol. How do you manage each of these? E.g. Sample collection, replicates, primer choices, library preparation, bioinformatics and statistical analysis?

Vasco: Metabarcoding can be affected by a multitude of biases, many of which we can reduce but not completely avoid (e.g. primer bias). How to deal with these biases strongly depends on the research question as well as the financial and time resources available. For example, samples can vary strongly in specimen biomass. As small and rare specimens contribute only little DNA when extracting bulk samples, they might remain undetected with metabarcoding especially if they are not well amplified with the used primer set. Here sorting and processing samples in several size classes can help to increase overall taxa detection rates. While we did clearly demonstrate this in a study, we rarely apply specimen sorting in practice as it increases the laboratory workload substantially. It is very important to be aware of the biases affecting your metabarcoding dataset, and then decide which ones should be reduced by e.g. using optimised primer sets, sample replication, higher sequencing depth. This, however, is very much dependent on the research question the data accuracy needed.

Owen: Indeed, metabarcoding data can be biased and the resulting datasets are different from morphological results. However, we need to state that morphological methods are not exempt of their own biases. We have been considering morphological recounts as the gold standard for biomonitoring and biodiversity assessment, and we have spent a lot of time trying to calibrate our metabarcoding data against this morphological standard. However, morphological recounts performed by under-trained operators, using incomplete zoological keys and guides, will fail to detect some species, just as metabarcoding primer bias does. Aliquoting big morphological samples for analysing just a part of them is the equivalent of detection fails due to low sequencing depth, and the occurrence of cryptic species complexes is the morphological equivalent of primers with low taxonomic resolution. Moreover, in most cases, incomplete zoological keys are the norm, especially for understudied groups such as nematodes, meiofaunal or edaphic faunal species, or in case of understudied geographical regions, which is the equivalent of incomplete reference databases for metabarcoding assignment. In all of these cases, a metabarcoding approach, with all their biases, may actually outperform any morphological assessment. And results from metabarcoding will always be more repeatable, more objective and independent of of the expertise of the analyst.
So, why don't we change the paradigm? For most ecological and biomonitoring applications, metabarcoding will yield more complete, objective, repeatable and useful results than morphology, even if we fail to detect some species due to a little degree of primer bias. We will get qualitative information that can be orders of magnitude more useful than morphology, even if we cannot assign a species name to a given sequence, we still can use that sequence as a bioindicator of pristine or impacted habitats. And we will have hundreds or thousands of those ecologically informative sequences! We just need to rethink what was the initial aim of our ecological analyses. If our aim was something like to get objective, ecologically-relevant information in a fast and cost-effective way, more than to build a long, exhaustive catalogue of all morphospecies present in the area, I would undoubtedly favour a metabarcoding approach, even with all of their potential biases.

To what extent can metabarcoding data generate quantitative estimates?

Vasco: It depends! In most cases, absolute abundance of taxa can't be estimated, as specimens vary in biomass. Additionally, not all specimens are amplified with the same efficiency due to primer bias, substantially skewing sequence abundance between taxa. This effect can be reduced by using optimised ecosystem and group-specific primers or applying correction factors to the metabarcoding data. However, this still can't solve the problem of taxa remaining completely undetected and deviation from the actual specimen biomass, despite the reduction of biases. Thus, in my opinion, also estimating biomass from metabarcoding data is tricky. However, biases between samples should be fairly reproducible enabling the comparison of relative sequence abundance between samples which were processed with the same laboratory and bioinformatic methods. This enables (in my opinion) semi-quantitative estimates. To some degree metabarcoding, data can tell you that a specific taxon is present more in sample A than in sample B, but not how exactly this specimen compares to the rest of the community abundance wise. However, how and what kind of abundance data can or should be derived from metabarcoding data is still a hot topic and actively debated, with no clear consensus having emerged in the community.

Owen: This depends a lot on the type of marker that you want to use, and on whether you are assessing extra-organismal environmental DNA or bulk community-DNA. If your aim is detecting extra-organismal environmental DNA (e.g. fish DNA in the sea), it's better that you forget everything about quantitative value. Conversely, if you're using universal primers, with low primer bias, on bulk community samples, you will get some results that have, at least, some semi-quantitative value. Of course, the number of reads will never be proportional to the number of individuals. But it could be proportional to the biomass of mitochondrial DNA from each species (if you're using a mitochondrial unbiased marker), or to the copy number of chloroplast genomes per cell (if you're using a chloroplast marker). In some cases, correction coefficients can be calculated for every species, and then the metabarcoding results can be fairly quantitative. For example, this is currently being studied in the case of river diatom analyses using chloroplast markers and the results seem to be very promising.

In your opinion, what are the current strengths and limitations of metabarcoding as a tool in both research and biodiversity monitoring?

Vasco: Metabarcoding is a fantastic tool to rapidly and with relatively little effort investigate the composition of environmental samples (often) on species level. While metabarcoding also has clear limitations in comparison to morphological identification methods, e.g. not all taxa in the sample will be detected, this should not stop us from utilizing this tool to answer pressing ecological questions and investigate environmental issues.

Owen: The current strengths of metabarcoding, compared to morphological methods: objectivity (independence of the degree of expertise of the analyst), repeatability, shorter time required for the analysis and ability to detect rare species in the samples. For biomonitoring purposes, the number of potential ecologically informative sequences (from a wide taxonomical range) usually is some orders of magnitude higher than any approach based on morphological data of just one taxon. Other important point is that some population genetics information can be retrieved from metabarcoding data (if you use hypervariable markers, such as COI). The limitations: most metabarcoding approaches cannot give some important ecological and biological information about the different species present in the samples, which could be retrieved from classical morphological analyses, such as maturation state, physiological condition, reproductive status, individual sizes or accurate estimation for total biomass of each species.

Thanks guys for your time. See you in Berlin!!