Phylogenetic Trees, Coverage and Diversity

To summarise the process so far, we've:

1. Taken a sample and either stored it or immediately extracted the DNA from the bacteria present.

2. Amplified the 16S rRNA genes in the DNA sample using PCR.

3. Built a clone library using the amplified 16S rRNA genes.

4. Taken random samples from the clone library and sequenced the 16S rRNA genes using chain termination sequencing.

Now we have hundreds of letters strung together and have to produce some publishable results. Let's have a look at a paper to help us decide what we should do now.

"16S rRNA gene-based analysis of mucosa-associated bacterial community and phylogeny in the chicken gastrointestinal tracts: from grops to ceca" - Gong et al. (2007) is a nice one to have a look at first, and the PDF is here.

Figure 3 is the simplest to look at and decipher:

The caption says the following:

"Fig. 3. Unrooted phylogenetic tree of bacteria on the wall of crops and gizzards constructed by a neighbor-joining method. The scale bar represents a sequence divergence of 10%. Clones represented by CpA and GzA were generated from this study. CpA36 had <97% similarity to the existing database sequences of 16S rRNA genes. This clone has GenBank accession number AY654970 with the name CCRP36."

I've highlighted the important parts that we're going to go into more detail with.

Unrooted phylogenetic tree - A phylogenetic tree is designed to demonstrate how similar genetic sequences are between species. Unrooted simply refers to the fact that we can't trace the species back to a common ancestor.

The line at the bottom with "0.1" on it is the scale bar. We can use this to gauge the sequence divergence between different 16S rRNA genes from the clone library. Sequences found in the study are the ones that appear as CpA or GzA, the other branches are bacterial species for which we know the 16S rRNA sequence.

All of the sequences recovered from the study will have been compared to databases of known 16S rRNA gene sequences. Some of the sequences in the clone library will correspond to known species and some won't. We can use the known species to infer what genus or family the unknown species are likely to be. In this particular tree, there are no completely unknown sequences that were recovered.

For example, we can see that GzA1, which was found in 22 clones, is identical to the sequence for Lactobacillus aviarius. From this, we can say that GzA1 is L. aviarius. Look at little further up and there are some clustered sequences: GzA21, CpA11 and CpA31. Using the scale we can see that GzA21 has the same sequence as the 16S rRNA gene for L. johnsonii. CpA11 and CpA31 have their own branches, but they're very, very close together. This shows that although they weren't 100% the same as the textbook L. johnsonii sequence, they're so similar as to be the same species.

We can also get some information about how closely related different species are. From the phylogenetic tree we can see that L. aviarius and L.salivarius are more closely related to each other than to L. johnsonii.

The numbers at each branch have nothing to do with the sequence divergence. These are bootstrap values. Phylogenetic trees are not drawn by hand. Instead, all the data is handed over to a computer which can use different methods to produce a phylogenetic tree. The computer doesn't just draw one tree, it draws lots and then compares them to decide which is the most likely. The bootstrap value refers to how reliable a node in the tree is. A value of 100 means that the node appeared in 100% of the trees which the computer drew, so we can be fairly sure that it is accurate. The lower the bootstrap value, the less reliable the node is. The lowest bootstrap value here is 90, so we can be fairly sure that the tree is an accurate representation of the phylogeny of the sample.

The neighbour-joining method refers to the method that was used to draw the phylogenetic tree... if you're looking for an explanation of what's going on you're going to have to go elsewhere. There are two predominant methods of drawing phylogenetic trees: neighbour-joining and maximum likelihood. The maths behind both is currently beyond my understanding, but there may be a future post about it at some point! For the moment it's enough to know simply that that was the method used to draw the tree.

So our sequencing of the genes in the clone library has allowed us to draw a phylogenetic tree and see what the relationships are between the species present. We can also produce some diversity indices.

Coverage and Sampling

One of the other questions to ask when looking at studies that have used clone libraries is how representative our library is of the sampled community. We need to make sure that we've analysed enough clones from the library to say that we've got valid results. For example, if I've sampled a very homogenous community, I might only need to analyse 50 clones to get a representative result, but for diverse communities I might need to analyse hundreds.

One of these measurements is coverage, and is calculated using this equation:

C_x = 1 - (n_x/N)

n_x = the number of OTUs recovered from the library
N = The total number of clones analysed

If the coverage is closer to 0, it's considered to be poor. For example, if I sample 10 clones from the library and get 8 unique OTUs then you can assume that I'm missing a lot of OTUs that are in the original sampled community and that I need to analyse more clones to get a representative result. This is reflected in the coverage which would be 0.2 in this case. For more information about using coverages and different ways of calculating the coverage, refer to this paper here.

Another way is to use an accumulation curve. These are simple graphs which plot the number of clones sampled against the number of different OTUs found (or to use the nomenclature above n_x against N).

An Accumulation Curve being plotted

You would expect the initial gradient to be steep as the initial clones are likely to be from different species. However, as the sample becomes more representative of the actual community, the line begins to plateau. An accumulation curve which has plateaued is indicative of a representative result and sampling can stop.

Diversity Indices

In terms of clone libraries, diversity indices are richness estimates are only useful when comparing two libraries. Say for example, I wanted to see if the community of bacteria in my armpit was more diverse that that of my nostril, I could use a diversity or richness index to show this. However, because of the biases introduced in the sampling, amplification and cloning processes, you can't say that your calculated diversity index accurately reflects that of the sampled community.

So, in brief, there are 3 commonly used indices, one for richness and the other two for diversity. Richness only takes into account the number of different species whereas diversity also takes into account the "evenness" (Is there one dominant species and 9 others with a much smaller abundance or are there 10 species each with a 10% share of the abundance):

Chao1 Richness Estimator

The Chao1 Estimator is a way of estimating how many species are in a community based on the sample you've taken. The equation is:

S_est = S_obs + F₁²/ 2F₂

S_obs = Number of OTUs sampled

F₁ = Number of OTUs sampled once (singletons)
F₂ = Number of OTUs sampled twice (doubletons)

If there are lots of species that have only been sampled once, you would assume that the sample is not representative and that there are still lots of species in the sample that weren't found.

Simpson's Index

This index is calculated by the equation:

D = Σ n(n-1) / N(N-1)

n = the total number of times an OTU was found
N = the total number of clones sampled

Simpson's index is a measure of dominance. It's based on the probability of selecting two individuals of the same species from the population at random. If there are a few dominant species (the community has poor diversity) then it will tend to 1, since it's more likely that two individuals randomly selected will be from the same species. If there are lots of species with a fairly equal share of the abundance (a diverse community) it tends to 0. Ideally, we want a number bigger to represent greater diversity so Simpson's Index is usually reported as Simpson's Index of Diversity or Simpson's Reciprocal Index. Simpson's Index of Diversity (1 - D) where diversity is greater when closer to 1 and Simpson's Reciprocal Index (1/D) which starts at 1 and increases with increasing diversity up to a maximum of the total number of different species in sample (5 different species means a maximum value of 5).

Simpson's indices give a greater weight to more abundant species (it prioritises richness over evenness), so introducing more species with low abundance doesn't change the diversity very much.

Shannon Index

The Shannon Index is calculated by:

H = - Σ p_i lnp_i

p_i = the proportion of clones from an OTU (n/N)

Usually the index is between 1.5 and 3.5, with values higher than 4 rare.

The Shannon Index gives a better overview of richness and evenness. This can be seen as beneficial, however, it does make it more difficult to compare communities that have very different richness levels.

I think that's enough for today...

Wednesday, 11 January 2017

Phylogenetic Trees, Coverage and Diversity

No comments:

Post a Comment