Monday 23 January 2017

A Model of Sampling Clone Libraries 2

This post is carrying on from the previous one, examining a model of clone libraries to work out which analysis we can rely on to give us an accurate idea of the microbial community that was sampled.

To produce these graphs, I ran the same simulation as before (create a virtual microbial community with 2 variables, number of OTUs and Growth Factor; create a virtual clone library; randomly sample the clone library and finally 'sequence' the samples), but this time each model was sampled 100 times. The results for various parameters and analyses are presented below in box and whisker plots. As a reminder, here are the abundance charts for models 1-5.

Model 1: 20 OTUs, GF = 1
Model 2: 35 OTUs, GF = 1
Model 3: 50 OTUs, GF = 0.5
Model 4: 50 OTUs, GF = 1
Model 5: 50 OTUs, GF = 2

First of all, let's have a look at what percentage of the OTUs in the model community appear in the sampled community.

Figure 1 - Percentage of Unsampled OTUs

As you would expect, as the number of OTUs in the modeled community increases, so does the percentage left unsampled. At 20 OTUs a median of 10% of the OTUs went unnoticed, but for 50 OTUs that goes up to between 17% and 25%. Equally in more uneven communities, the unsampled OTUs was also larger. Indeed, there was one instance when nearly 45% of the OTUs in Model 5 were unsampled. So imagine what it's going to be like in a complex community which may harbor hundreds of OTUs.

Figure 2 - Chao1 Richness Estimator

The Chao1 Richness Estimator tries to give us an idea of how many OTUs were present in our original community based on the sample (there's a full explanation in a previous post)

In general, the Chao1 Estimator is giving a more or less accurate estimate of how many OTUs were present in the model community. For Model 1, the interquartile range is only 19-21, with the median sitting bang on 20. However, there are some examples there where it was very wrong. Once it estimated the number of OTUs in the original sample as 41, double the correct number. And this is mirrored in the results of the other Models. Generally the interquartile range is relatively small with the median sitting on the correct number of OTUs, but there are a few outlying points where the Chao1 was very, very wrong.

Figure 3 - Coverage

Coverage tries to give us an idea of how well the community has been sampled.
And as you would expect, it goes down as the number of OTUs and the evenness in the model increases. We'll come back to the coverage later.

Figures 4 & 5 - Shannon Index and Simpson Index of Diversity

These two plots give quite a nice illustration of how the Shannon Index and Simpson Index of Diversity (SIoD) take into account both the species diversity, but also the evenness. As you can see, both indices increase as the number of OTUs increases, but then begin to decrease in less even communities.

Can We Use Coverage To Judge Accuracy?

One of the questions that the previous post threw up was whether or not we can use Coverage to judge if other indices (Chao1, Shannon and SIoD) calculated for the sampled community reflect that of the original community. Below are 3 scatter plots showing the difference between the index of the model and sampled communities plotted against the coverage of the sample. Our initial results suggested that we might be able to, however, that doesn't seem to hold up very well.

There is a moderate negative correlation (Spearman's Rank = -0.479) between the Shannon Index and the Coverage. So a low coverage is moderately correlated to a bigger difference between the Shannon Index of the model and the sample. 

There is a weak negative correlation (Spearman's Rank = -0.389) between coverage and Chao1 Richness Estimator. So a low coverage would lead us to believe that the Chao1 is inaccurate, but that's not always the case.

Finally, there is nearly no correlation (Spearman's Rank = -0.225) between the difference in SIoD and coverage, so a high coverage doesn't necessarily mean an accurate SIoD.

Coverage Is Best At Indicating % of Unsampled OTUs

As you can see from the scatter plot below, coverage is best correlated (Spearman's Rank = -0.497) to the % of unsampled OTUs. So a low coverage generally means that the sample is less representative of the original community.

I think that's enough about Clone Libraries... next stop Denaturing Gradient Gel Electrophoresis... the fun never stops.

No comments:

Post a Comment