Thursday 23 February 2017

Constructing UPGMA Trees from DGGE Profiles


You've taken your sample, extracted the DNA, amplified the 16S rDNA in a PCR machine, run a DGGE gel and you end up with something like the picture below.

Bioguz at English Wikipedia, 16S PCR DGGE, CC BY-SA 3.0
You can see that in some lanes there are more bands than others, some bands are darker than others.. it's a start, but it's not like you can publish that as a brilliant discovery. It's time to do some analysis! As we discussed earlier, analysing the "darkness" of the bands to quantify how much DNA is there isn't a reliable analysis. The DGGE profile is essentially a fingerprint of the microbial community, so we have to rely on comparing different fingerprints and deciding how different they are.

We can compare fingerprints objectively and decide that "Column A looks different to B, it's got 3 bands in different places. Also, it looks more similar to B than it does to C", but that's about it.

Quantifying the Qualitative
What we really need for convincing results is to quantify the differences. The first step is to change our DGGE profile into a matrix. Each column is a sample, and each row is where a band appears in at least one of the columns. If a band is present we assign the value 1 and if not it gets a 0.

Converting a DGGE profile to a Binary Matrix. 
Now that we've assigned numbers to everything, we can get a quantitative idea of "distance" between different samples. To do this, we can either use a Jaccard index or Dice's coefficient. I'm going to focus on the Jaccard index. The formula is:

J(A,B) = |A∩B| ÷ |A∪B|

If you love maths, feel free to read about ∩ and ∪ here. The plain English translation is:


Percentage similarity = (|Number of rows where both columns are 1| ÷ |Number of rows where at least one of the columns is 1|) x 100
If we compare A and B from the profile above, there are 4 different rows which have bands in both A and B and there are 6 different rows which have bands in either A or B. So our percentage similarity is:
(4 ÷ 6) x 100 = 66.67
A and B are 66.67% similar. If we compare every sample with every other, we can build a similarity matrix. The similarity matrix for the profile above is:


A B C D E

A 100 -- -- -- --

B 66.67 100 -- -- --

C 42.86 25.00 100 -- --

D 16.67 33.33 14.29 100 --

E 25.00 11.11 66.67 33.33 100
This similarity matrix is a way of quantifying how different each of the samples is to each of the others. We can see quite easily that B and E are the most different and that C and E / A and B are the most similar samples. The rest of the data gets lost in the table. We want to present it in a way that's going to be easy for someone to look at and instantly see the relationships between different samples.

UPGMA Trees
Now that we've got a similairty matrix, we can build a UPGMA tree.
The first step is to identify which of our samples are most similar. A and B have a similarity of 66.67%, and so do C and E. We can plot this on a graph as shown below: 

The vertical lines are drawn at 66.6%. Now that AB and CE are joined together, we need to make a new similarity matrix to reflect this, starting with AB. We do this by averaging the similarities of A and B to all the other points as per the original similarity matrix. For example, in the original matrix A is 42.86% similar to C and B is 25% similar to C. Therefore the similarity between AB and C is (42.86 + 25)/2 or 33.93%. 



AB C D E

AB 100 -- -- --

C 33.93 100 -- --

D 25 14.29 100 --

E 18.06 66.67 33.33 100
Now that we've calculated the similarity between AB and the other samples, we can do the same for CE.



AB CE D


AB 100 -- --


CE 26 100 --


D 25 23.81 100
With our new similarity matrix, we look to see which points out of AB, CE and D are the closest together. AB and CE come out on top with 26% similarity. We can plot this on the graph:

Now AB and CE are joined together, so we need to make another similarity matrix to reflect this. Again, we'll need to re-calculate the similarity of ABCE to D by averaging the similarities of AB to D and CE to D which would be (25 + 23.81)/2.




ABCE D



ABCE 100 --



D 24.405 100


We can plot out final lines on the graph, joining ABCE and D at 24.4% similarity to produce the finished UPGMA tree.
This gives us a visually easy way of interpreting the data. Straight away we can group our 5 samples into 3 families which are roughly equally different from each other: AB, CE and D.
Is band intensity a measure of bacterial abundance?
Yes and no, but probably more no than yes. It's possible to analyse the intensity of bands in a DGGE gel to get a semi-quantitative idea of the abundance of bacteria from the sample. This requires imaging software which compares the intensity of each band to the total intensity in the lane (the sum of all the intensities of all the bands in the lane). 
What you're getting there is relative abundance so you can't conclude:
"Band X has an intensity value of 3, therefore there must have been 2.0x106 Pasturella multocida in the sample"
but rather:

"Band X contributes 50% of the total band intensity, so roughly 50% of the bacteria in this sample contributed to Band X."
Of course, the waters are further muddied by the inconveniences that contribute to PCR biases like some bacteria having more than one copy of the 16S rRNA gene which may or may not all have the same sequence.
And then there's even more mud added when you consider some of the problems with DGGE like gel to gel variation. So at the end rather than a crystalline answer you've got something that looks decidedly murky.
In short:
Yes, you can use DGGE as a quantitative method BUT it's only semi-quantitative and why would you want to when you've got Next Generation Sequencing or quantitative PCR which are way better?
Some studies will list a Shannon index (discussed at the bottom of this blog). This analysis will be based on the intensity of the bands, since the Shannon index cannot work with the binary data that we used to draw a UPGMA tree. This is a valid analysis if you're comparing samples within your study, since all samples were analysed on the same gel and using the same software. However, remember it's not an absolute value so it can't be compared across studies, especially if it was calculated from a DGGE gel.

Thursday 2 February 2017

Denaturing and Temperature Gradient Gel Electrophoresis

Denaturing and temperature gradient gel electrophoresis (DGGE and TGGE) are methods of generating fingerprints of microbial communities. DGGE was first described by in 1979⁠, but wasn’t applied to profiling microbial communities until 1993⁠.

How does DGGE work?

The technique takes advantage of the different melting domains of double stranded DNA based on sequence and that a partially melted DNA molecule will stop moving through an agarose gel allowing for separation of sequences based on a difference of one base pair. A gradient of either temperature or a denaturant such as formamide or urea is created over an agarose gel. When a sample of different 16S rRNA sequences is run through the gel, they will separate according to differences in nucleotide composition (and therefore phylogeny) rather than size. In general, the higher the G-C content of a sequence, the further through the gel it will advance⁠.

Each microbial community will produce a different pattern of bands depending on its composition so the technique can be applied to investigate differences in genetic diversity between microbial communities or in the same community over time⁠. DGGE will detect microorganisms which contribute to more than 1% of the total microbial population. If we want to examine less abundant groups, we could do a PCR on our original samples using group specific primers.

After electrophoresis has separated different sequences, they can be cut out and sequence the DNA that's creating the band. If DGGE is going to be used to analyse the products of PCR amplification, primers with a GC-clamp (a 40 base pair sequence composed entirely of GC base pairs) are required. The GC-clamp prevents complete separation of DNA strands to form single stranded DNA which would continue to advance through the agarose gel and form extra bands. 

What you get out the other end is something that looks like this:

Bioguz at English Wikipedia, 16S PCR DGGE, CC BY-SA 3.0


Biases and Limitations

As with any technique there are going to be some limitations and other factors to take into account when we're analysing the results. At this stage, those pesky heteroduplexes and chimeras from the PCR are going to start causing problems.

Diversity may be overestimated because of:
  1. Chimeras – formed during PCR and will behave as a normal rDNA sequence. Removal and sequencing of a band could allow for detection of chimeric sequences with the relevant software.
  2. Intraspecific rDNA heterogeneities – Since single species may have rRNA operons which differ by as much as 1%, these can form separate bands in the gel.
  3. Heterduplexes – Since heteroduplexes are less stable and they can appear as unspecific bands within the gel. However, it is generally thought that heteroduplexes do not interfere with DGGE analysis.

It's also possible that we might lose some diversity because of:
  1. Comigration of fragments – As the separation of fragments if based on G-C content and not nucleotide sequence, it is possible that fragments with a different sequence will migrate to the same position in the gel⁠.
  2. Merging of bands – Two rDNA fragments with similar sequences can migrate to adjacent positions on the gel. If the bands are close enough, they may merge and appear as a single band.

As well as these biases, DGGE has some limitations to the technique:
  1. Small fragment size – While DGGE allows for excision and sequencing of bands, rDNA fragments are only 500 basepairs. This provides limited information for phylogenetic anaylsis⁠. This can be overcome by relying co-analysis with clone libraries containing longer rDNA fragments which can then be used to construct phylogenetic trees⁠.
  2. Gel-to-gel variation – Subtle differences in gel preparation limit the viability of comparing samples run on different gels. This limits the number of samples that can be compared as they must all be run on the same gel. Software is now available to help minimise the effect of gel-to-gel variation.
  3. DGGE is semi-quantitative – As we mentioned before when talking about differential PCR amplification, some bacteria have more than one copy of the 16S rRNA gene. If the different DNA fragments migrate to the same point on the gel, this will give an inaccurate intensity of that band. Equally, different DNA fragments from the same species can migrate to form different bands on the gel, giving a false impression of diversity.