Monday 23 January 2017

A Model of Sampling Clone Libraries 2


This post is carrying on from the previous one, examining a model of clone libraries to work out which analysis we can rely on to give us an accurate idea of the microbial community that was sampled.

To produce these graphs, I ran the same simulation as before (create a virtual microbial community with 2 variables, number of OTUs and Growth Factor; create a virtual clone library; randomly sample the clone library and finally 'sequence' the samples), but this time each model was sampled 100 times. The results for various parameters and analyses are presented below in box and whisker plots. As a reminder, here are the abundance charts for models 1-5.

Model 1: 20 OTUs, GF = 1
Model 2: 35 OTUs, GF = 1
Model 3: 50 OTUs, GF = 0.5
Model 4: 50 OTUs, GF = 1
Model 5: 50 OTUs, GF = 2

First of all, let's have a look at what percentage of the OTUs in the model community appear in the sampled community.

Figure 1 - Percentage of Unsampled OTUs

As you would expect, as the number of OTUs in the modeled community increases, so does the percentage left unsampled. At 20 OTUs a median of 10% of the OTUs went unnoticed, but for 50 OTUs that goes up to between 17% and 25%. Equally in more uneven communities, the unsampled OTUs was also larger. Indeed, there was one instance when nearly 45% of the OTUs in Model 5 were unsampled. So imagine what it's going to be like in a complex community which may harbor hundreds of OTUs.

Figure 2 - Chao1 Richness Estimator

The Chao1 Richness Estimator tries to give us an idea of how many OTUs were present in our original community based on the sample (there's a full explanation in a previous post)


In general, the Chao1 Estimator is giving a more or less accurate estimate of how many OTUs were present in the model community. For Model 1, the interquartile range is only 19-21, with the median sitting bang on 20. However, there are some examples there where it was very wrong. Once it estimated the number of OTUs in the original sample as 41, double the correct number. And this is mirrored in the results of the other Models. Generally the interquartile range is relatively small with the median sitting on the correct number of OTUs, but there are a few outlying points where the Chao1 was very, very wrong.

Figure 3 - Coverage

Coverage tries to give us an idea of how well the community has been sampled.
And as you would expect, it goes down as the number of OTUs and the evenness in the model increases. We'll come back to the coverage later.

Figures 4 & 5 - Shannon Index and Simpson Index of Diversity

These two plots give quite a nice illustration of how the Shannon Index and Simpson Index of Diversity (SIoD) take into account both the species diversity, but also the evenness. As you can see, both indices increase as the number of OTUs increases, but then begin to decrease in less even communities.

Can We Use Coverage To Judge Accuracy?

One of the questions that the previous post threw up was whether or not we can use Coverage to judge if other indices (Chao1, Shannon and SIoD) calculated for the sampled community reflect that of the original community. Below are 3 scatter plots showing the difference between the index of the model and sampled communities plotted against the coverage of the sample. Our initial results suggested that we might be able to, however, that doesn't seem to hold up very well.


There is a moderate negative correlation (Spearman's Rank = -0.479) between the Shannon Index and the Coverage. So a low coverage is moderately correlated to a bigger difference between the Shannon Index of the model and the sample. 

There is a weak negative correlation (Spearman's Rank = -0.389) between coverage and Chao1 Richness Estimator. So a low coverage would lead us to believe that the Chao1 is inaccurate, but that's not always the case.



Finally, there is nearly no correlation (Spearman's Rank = -0.225) between the difference in SIoD and coverage, so a high coverage doesn't necessarily mean an accurate SIoD.

Coverage Is Best At Indicating % of Unsampled OTUs

As you can see from the scatter plot below, coverage is best correlated (Spearman's Rank = -0.497) to the % of unsampled OTUs. So a low coverage generally means that the sample is less representative of the original community.


I think that's enough about Clone Libraries... next stop Denaturing Gradient Gel Electrophoresis... the fun never stops.

Tuesday 17 January 2017

A Model of Sampling from a Clone Library


Clone libraries are usually used to try and define which bacteria are found in different places. However, it might be tempting to infer abundances from this data. To illustrate how this is a potential pitfall in the interpretation of clone libraries, I wrote a Python program which creates a model bacterial community and then samples it randomly, mimicking the process of sampling from a clone library. All figures can be found in their full size in this imgur album.

I created 5 different model communities:

Full sized images can be found in this album
The bar graphs above show the percentage abundance of each OTU in the models. Two variables were changed to produce the different model communities: number of OTUs and Growth Factor (GF). GF determined the evenness of the model, with a lower GF producing a more even population.

Model 1: 20 OTUs, GF = 1
Model 2: 35 OTUs, GF = 1
Model 3: 50 OTUs, GF = 0.5
Model 4: 50 OTUs, GF = 1
Model 5: 50 OTUs, GF = 2

I then created a virtual clone library from each modeled community by randomly selecting OTUs. Each clone library was formed of 1,000 clones. From these 1,000 clones, I "sequenced" 100 clones to produce a Sampled Abundance to compare to the modeled abundance, producing the following bar graphs.

Full sized images can be found in this album
It's worth having a look at these bar graphs more closely. They're a nice demonstration of how clone libraries are not an especially good way of estimating the abundance of different OTUs. If we sampled 10,000 clones per library, we would get a good idea of the actual abundance of each OTU in the community, but no one has the resources. Just from looking at the graphs it's hard to draw any meaningful conclusions. We can tell how many OTUs weren't sampled:

Model Number % OTUs Unsampled
1 5%
2 8.60%
3 12%
4 32%
5 38%

It appears that both having more OTUs and a more uneven community will leave a greater percentage of OTUs unsampled. Our sample size isn't brilliant, but we'll see if we can do something about that in a minute.

But first, this is a good opportunity to see the analysis discussed in the previous blog post in action. We can draw accumulation curves of the the sampling process:

Full sized images can be found in this album

What do these accumulation curves tell us? If we look at Model 1 and Model 2, the curves are beginning to plateau after sampling 100 clones from the libraries. Although we've no doubt missed some species, the plateau signals that we've got a pretty good idea of what OTUs are in the community. Models 3 and 4 haven't yet plateaued, although 3 is getting there, and this is reflected in the % of unsampled OTUs in the table above. However, Model 5 reveals a trap of accumulation curves. This curve looks like it's plateauing, but we know that actually 38% of the OTUs. The difference between Model 5 and Models 3 and 4 is the evenness. Model 5 is supremely uneven, with 1 OTU hogging most of the abundance and lots of other OTUs with only a 1% share. While we've sampled all of the main species, actually, we've left out a lot of the minor species in the community. So accumulation curves are more of an indication of whether the major species of a sample have been sampled and will tend to ignore minor species.

We can overcome this by looking at the coverage.

Model Number
Coverage
1
0.842
2
0.780
3
0.680
4
0.530
5
0.613

As you would expect, the coverages for models with 50 OTUs isn't great and more accurately reflects the percentage of OTUs which have been sampled. The moral of the story is: Don't trust an accumulation curve without a coverage when looking at clone libraries
Coverage is also important when considering whether or not we can trust any other indices generated from the analysis of the clone library. Take the Chao1 Estimator, the method for estimating the total number of species in the community based on what was sampled from the clone library.

Model Number Chao1 Estimator
1 20.5
2 34.5
3 51.5
4 76.7
5 39.0

The estimator is pretty good at guessing how many species there were in the original model, except in the two models where coverage was lowest where we have a terrible overestimation followed by a horrible underestimation. Unsightly.

The same is true for the Shannon and Simpson indices:

Model NumberSimpson's Index of DiversityDifferenceShannon IndexDifference
1Model0.9210.0012.670.03
Sample0.9202.64
2Model0.9670.0033.350.09
Sample0.9643.26
3Model0.9840.0033.770.12
Sample0.9813.65
4Model0.9740.0363.670.57
Sample0.9383.10
5Model0.9250.0233.320.46
Sample0.9022.86





The models with the lowest coverage have the highest differences between the Simpson's Index of Diversity and the Shannon Index calculated for the model and the sample.










As I said earlier we, our sample is very small. We've only got 1 replication of sampling from each model... but because I'm using a computer program to do it, I could replicate it 10,000 times. That's a bit excessive, so let's just sample each model 100 times and see what we get. However, we'll have to leave that for the next post, as I should probably get on with doing some actual work!







Wednesday 11 January 2017

Phylogenetic Trees, Coverage and Diversity

To summarise the process so far, we've:

1. Taken a sample and either stored it or immediately extracted the DNA from the bacteria present.
2. Amplified the 16S rRNA genes in the DNA sample using PCR.
3. Built a clone library using the amplified 16S rRNA genes.
4. Taken random samples from the clone library and sequenced the 16S rRNA genes using chain termination sequencing.

Now we have hundreds of letters strung together and have to produce some publishable results. Let's have a look at a paper to help us decide what we should do now. 

"16S rRNA gene-based analysis of mucosa-associated bacterial community and phylogeny in the chicken gastrointestinal tracts: from grops to ceca" - Gong et al. (2007) is a nice one to have a look at first, and the PDF is here.

Figure 3 is the simplest to look at and decipher:


The caption says the following:

"Fig. 3. Unrooted phylogenetic tree of bacteria on the wall of crops and gizzards constructed by a neighbor-joining method. The scale bar represents a sequence divergence of 10%. Clones represented by CpA and GzA were generated from this study. CpA36 had <97% similarity to the existing database sequences of 16S rRNA genes. This clone has GenBank accession number AY654970 with the name CCRP36."

I've highlighted the important parts that we're going to go into more detail with.

Unrooted phylogenetic tree - A phylogenetic tree is designed to demonstrate how similar genetic sequences are between species. Unrooted simply refers to the fact that we can't trace the species back to a common ancestor. 

The line at the bottom with "0.1" on it is the scale bar. We can use this to gauge the sequence divergence between different 16S rRNA genes from the clone library. Sequences found in the study are the ones that appear as CpA or GzA, the other branches are bacterial species for which we know the 16S rRNA sequence. 

All of the sequences recovered from the study will have been compared to databases of known 16S rRNA gene sequences. Some of the sequences in the clone library will correspond to known species and some won't. We can use the known species to infer what genus or family the unknown species are likely to be. In this particular tree, there are no completely unknown sequences that were recovered.

For example, we can see that GzA1, which was found in 22 clones, is identical to the sequence for Lactobacillus aviarius. From this, we can say that GzA1  is L. aviarius. Look at little further up and there are some clustered sequences: GzA21, CpA11 and CpA31. Using the scale we can see that GzA21 has the same sequence as the 16S rRNA gene for L. johnsonii. CpA11 and CpA31 have their own branches, but they're very, very close together. This shows that although they weren't 100% the same as the textbook L. johnsonii sequence, they're so similar as to be the same species.

We can also get some information about how closely related different species are. From the phylogenetic tree we can see that L. aviarius and L.salivarius are more closely related to each other than to L. johnsonii.

The numbers at each branch have nothing to do with the sequence divergence. These are bootstrap values. Phylogenetic trees are not drawn by hand. Instead, all the data is handed over to a computer which can use different methods to produce a phylogenetic tree. The computer doesn't just draw one tree, it draws lots and then compares them to decide which is the most likely. The bootstrap value refers to how reliable a node in the tree is. A value of 100 means that the node appeared in 100% of the trees which the computer drew, so we can be fairly sure that it is accurate. The lower the bootstrap value, the less reliable the node is. The lowest bootstrap value here is 90, so we can be fairly sure that the tree is an accurate representation of the phylogeny of the sample. 

The neighbour-joining method refers to the method that was used to draw the phylogenetic tree... if you're looking for an explanation of what's going on you're going to have to go elsewhere. There are two predominant methods of drawing phylogenetic trees: neighbour-joining and maximum likelihood. The maths behind both is currently beyond my understanding, but there may be a future post about it at some point! For the moment it's enough to know simply that that was the method used to draw the tree.

So our sequencing of the genes in the clone library has allowed us to draw a phylogenetic tree and see what the relationships are between the species present. We can also produce some diversity indices.

Coverage and Sampling

One of the other questions to ask when looking at studies that have used clone libraries is how representative our library is of the sampled community. We need to make sure that we've analysed enough clones from the library to say that we've got valid results. For example, if I've sampled a very homogenous community, I might only need to analyse 50 clones to get a representative result, but for diverse communities I might need to analyse hundreds.

One of these measurements is coverage, and is calculated using this equation:

Cx = 1 - (nx/N)

nx = the number of OTUs recovered from the library
N = The total number of clones analysed


If the coverage is closer to 0, it's considered to be poor. For example, if I sample 10 clones from the library and get 8 unique OTUs then you can assume that I'm missing a lot of OTUs that are in the original sampled community and that I need to analyse more clones to get a representative result. This is reflected in the coverage which would be 0.2 in this case. For more information about using coverages and different ways of calculating the coverage, refer to this paper here.

Another way is to use an accumulation curve. These are simple graphs which plot the number of clones sampled against the number of different OTUs found (or to use the nomenclature above nx against N). 

An Accumulation Curve being plotted

You would expect the initial gradient to be steep as the initial clones are likely to be from different species. However, as the sample becomes more representative of the actual community, the line begins to plateau. An accumulation curve which has plateaued is indicative of a representative result and sampling can stop.

Diversity Indices

In terms of clone libraries, diversity indices are richness estimates are only useful when comparing two libraries. Say for example, I wanted to see if the community of bacteria in my armpit was more diverse that that of my nostril, I could use a diversity or richness index to show this. However, because of the biases introduced in the sampling, amplification and cloning processes, you can't say that your calculated diversity index accurately reflects that of the sampled community.

So, in brief, there are 3 commonly used indices, one for richness and the other two for diversity. Richness only takes into account the number of different species whereas diversity also takes into account the "evenness" (Is there one dominant species and 9 others with a much smaller abundance or are there 10 species each with a 10% share of the abundance):

Chao1 Richness Estimator

The Chao1 Estimator is a way of estimating how many species are in a community based on the sample you've taken. The equation is:

Sest = Sobs + F12/ 2F2

 Sobs = Number of OTUs sampled
 F1 = Number of OTUs sampled once (singletons)
F2 = Number of OTUs sampled twice (doubletons)

If there are lots of species that have only been sampled once, you would assume that the sample is not representative and that there are still lots of species in the sample that weren't found.

Simpson's Index

This index is calculated by the equation:

D = Σ n(n-1) / N(N-1)

n = the total number of times an OTU was found
N = the total number of clones sampled

Simpson's index is a measure of dominance. It's based on the probability of selecting two individuals of the same species from the population at random. If there are a few dominant species (the community has poor diversity) then it will tend to 1, since it's more likely that two individuals randomly selected will be from the same species. If there are lots of species with a fairly equal share of the abundance (a diverse community) it tends to 0. Ideally, we want a number bigger to represent greater diversity so Simpson's Index is usually reported as Simpson's Index of Diversity or Simpson's Reciprocal IndexSimpson's Index of Diversity (1 - D) where diversity is greater when closer to 1 and Simpson's Reciprocal Index (1/D) which starts at 1 and increases with increasing diversity up to a maximum of the total number of different species in sample (5 different species means a maximum value of 5).

Simpson's indices give a greater weight to more abundant species (it prioritises richness over evenness), so introducing more species with low abundance doesn't change the diversity very much.

Shannon Index

The Shannon Index is calculated by:

H = - Σ pi lnpi

pi = the proportion of clones from an OTU (n/N)

Usually the index is between 1.5 and 3.5, with values higher than 4 rare. 

The Shannon Index gives a better overview of richness and evenness. This can be seen as beneficial, however, it does make it more difficult to compare communities that have very different richness levels.

I think that's enough for today...