Tuesday 17 January 2017

A Model of Sampling from a Clone Library

Clone libraries are usually used to try and define which bacteria are found in different places. However, it might be tempting to infer abundances from this data. To illustrate how this is a potential pitfall in the interpretation of clone libraries, I wrote a Python program which creates a model bacterial community and then samples it randomly, mimicking the process of sampling from a clone library. All figures can be found in their full size in this imgur album.

I created 5 different model communities:

Full sized images can be found in this album
The bar graphs above show the percentage abundance of each OTU in the models. Two variables were changed to produce the different model communities: number of OTUs and Growth Factor (GF). GF determined the evenness of the model, with a lower GF producing a more even population.

Model 1: 20 OTUs, GF = 1
Model 2: 35 OTUs, GF = 1
Model 3: 50 OTUs, GF = 0.5
Model 4: 50 OTUs, GF = 1
Model 5: 50 OTUs, GF = 2

I then created a virtual clone library from each modeled community by randomly selecting OTUs. Each clone library was formed of 1,000 clones. From these 1,000 clones, I "sequenced" 100 clones to produce a Sampled Abundance to compare to the modeled abundance, producing the following bar graphs.

Full sized images can be found in this album
It's worth having a look at these bar graphs more closely. They're a nice demonstration of how clone libraries are not an especially good way of estimating the abundance of different OTUs. If we sampled 10,000 clones per library, we would get a good idea of the actual abundance of each OTU in the community, but no one has the resources. Just from looking at the graphs it's hard to draw any meaningful conclusions. We can tell how many OTUs weren't sampled:

Model Number % OTUs Unsampled
1 5%
2 8.60%
3 12%
4 32%
5 38%

It appears that both having more OTUs and a more uneven community will leave a greater percentage of OTUs unsampled. Our sample size isn't brilliant, but we'll see if we can do something about that in a minute.

But first, this is a good opportunity to see the analysis discussed in the previous blog post in action. We can draw accumulation curves of the the sampling process:

Full sized images can be found in this album

What do these accumulation curves tell us? If we look at Model 1 and Model 2, the curves are beginning to plateau after sampling 100 clones from the libraries. Although we've no doubt missed some species, the plateau signals that we've got a pretty good idea of what OTUs are in the community. Models 3 and 4 haven't yet plateaued, although 3 is getting there, and this is reflected in the % of unsampled OTUs in the table above. However, Model 5 reveals a trap of accumulation curves. This curve looks like it's plateauing, but we know that actually 38% of the OTUs. The difference between Model 5 and Models 3 and 4 is the evenness. Model 5 is supremely uneven, with 1 OTU hogging most of the abundance and lots of other OTUs with only a 1% share. While we've sampled all of the main species, actually, we've left out a lot of the minor species in the community. So accumulation curves are more of an indication of whether the major species of a sample have been sampled and will tend to ignore minor species.

We can overcome this by looking at the coverage.

Model Number

As you would expect, the coverages for models with 50 OTUs isn't great and more accurately reflects the percentage of OTUs which have been sampled. The moral of the story is: Don't trust an accumulation curve without a coverage when looking at clone libraries
Coverage is also important when considering whether or not we can trust any other indices generated from the analysis of the clone library. Take the Chao1 Estimator, the method for estimating the total number of species in the community based on what was sampled from the clone library.

Model Number Chao1 Estimator
1 20.5
2 34.5
3 51.5
4 76.7
5 39.0

The estimator is pretty good at guessing how many species there were in the original model, except in the two models where coverage was lowest where we have a terrible overestimation followed by a horrible underestimation. Unsightly.

The same is true for the Shannon and Simpson indices:

Model NumberSimpson's Index of DiversityDifferenceShannon IndexDifference

The models with the lowest coverage have the highest differences between the Simpson's Index of Diversity and the Shannon Index calculated for the model and the sample.

As I said earlier we, our sample is very small. We've only got 1 replication of sampling from each model... but because I'm using a computer program to do it, I could replicate it 10,000 times. That's a bit excessive, so let's just sample each model 100 times and see what we get. However, we'll have to leave that for the next post, as I should probably get on with doing some actual work!

No comments:

Post a Comment