Monday 19 December 2016

Chain Termination Sequencing

The environmental DNA has been amplified, the clone library has been built, but there's still one more step before we can start working out what was in our sample. Now we need to sequence the genes which are safely tucked away in the plasmids of the clone library.

Although chain termination sequencing has been largely superseded by Next Generation Sequencing techniques like pyrosequencing and the Illumina platform, it's a good place to start. It shares some characteristics with a lot of the more modern platforms and a lot of older material about microbial communities is based on these techniques.

Dideoxy Chain Termination (Sanger) Sequencing

Frederick Sanger developed this technique in 1977, and it remained the dominant method for 39 years. In its original form it was a very labour intensive process and many a PhD student spent long hours poring over electrophoresis gels to sequence viral genomes. However, once it was automated, it provided the basis for the Human Genome project. 

The method is very similar to PCR. You need template DNA, DNA polymerase, a primer and some nucleotides, but there are a few differences. Your initial DNA template is single stranded and you need to mix in a few dideoxynucleotides (ddNTPs) with your normal deoxynucleotides (dNTPs). A ddNTP lacks the hydroxyl group on the 3' carbon, making it impossible for another dNTP to bind to it. Once I've added a ddNTP to a DNA strand, that's it. It can't continue being replicated, it just stays as it is, it's been terminated.

Figure 1 - The structure of dNTPs (left) and ddNTPs (right). ddNTPs lack the hydroxyl groups which are needed for binding of further nucleotides.

Let's imagine that we stick the following template into the reaction:


We add a mixture of normal dNTPs and some ddTTP to terminate the chain wherever there's a T nucleotide. We would end up with a mixture of the following fragments at the end:


Now we need to visualise what's in our sample... We know we can reliably separate different lengths of DNA using gel electrophoresis. Also, if we put a radioactive label on the ddNTP we can develop the gel and see where the fragments are in the gel. Run 4 different reactions and run them on adjacent gels, and you can read off the sequence (Figure 2).

Figure 2 - Gel electrophoresis of the products of 4 reactions with ddATP, ddTTP, ddCTP and ddGTP allows for the sequencing of the initial DNA fragment.

This original method was very time consuming both in calibrating the reaction and electrophoresis, and manually reading the sequence off the gels. Luckily, it lends itself nicely to automation. Replace the radioactive tag with a fluorescent one, and you can run 4 reactions in one and still differentiate between the 4 different nucleotides that were added. Teach a computer that green = A, red = T, yellow = G, blue = C and you can multiply the productivity by many orders of magnitude. 

This technology was eventually minituarised so it could run in a capillary tube and be fully automated in a machine. The machines pictured below can run 96 capillaries at the same time.

Flickr user jurvetson, DNA-Sequencers from Flickr 57080968, CC BY 2.0
Sanger sequencing remains relevant, and if anything is more accurate, but it can't provide the depth that NGS techniques can. While each machine pictured above can sequence 96 pieces of DNA at a total rate of about 6Mb (Megabase, 1,000 bases) per day (1), a modern Illumina machine can sequence millions of DNA fragments.

Why don't we use Sanger Sequencing anymore?

The error rate for Sanger sequencing tends to be pretty low, with one error for every 10,000 to 100,000 nucleotides sequenced (1). As with other sequencing platforms, the error rate increases with longer DNA fragments. NGS techniques have higher error rates than Sanger sequencing. So why has it been replaced? The quick answer is cost and throughput. Microbiome research, especially metagenomics and proteomics, involves sequencing all the DNA in a sample. That's a massive amount of data and would take much longer using chain termination sequencing machines. Sanger sequencing costs about $500/Mb of sequencing to produce 6Mb of data per day. If you use pyrosequencing you can get 750Mb for $20/Mb. Illumina sequencing will provide me with 5000Mb and only cost me $0.50/Mb (1). Sanger sequencing has its place, but that place is no longer in microbial community research.

However, if you're reading a paper about clone library analysis which has used chain termination sequencing, then you can be pretty sure that the results that the sequences that they've produced are fairly accurate. You'll still have to bear in mind the errors and biases introduced by lysis, extraction, PCR and clone library preparation, but sequencing is unlikely to have skewed the data very much.

Of course, the key to clone libraries is analysing the sequence data which is produced. Drawing phylogenetic trees and interpreting diversity indices, amongst other things. This will be the focus of the next post.

1. High-throughout DNA sequencing - Concepts and limitations. Kircher, M., Kelso, J. Bioessays. 2010; 32(6):524-536.

Saturday 10 December 2016

Building Clone Libraries

Let's have a look at the Materials and Methods of a paper:

"Microbial Community Composition of the Ileum and Cecum of Broiler Chickens as Revealed by Molecular and Culture-Based Techniques"

There's a PDF copy here.

First they describe the animals used, what they were fed, how they were kept, etc. Then there's a bit about culturing some bacteria. They extract and purify their DNA, then go to talk about 16S rDNA Amplification and Cloning. There's a big bit about PCR, so far so good... but then there's this part:

"The products were then purified using a QIAquick PCR purification kit (Qiagen GmbH, Hilden, Germany) and stored at −20°C. The blunt-end PCR products were cloned into linearized pCR-blunt vectors (Invitrogen, Carlsbad, CA), and 1 shot TOP10 competent Escherichia coli cells were transformed using a Zero Blunt PCR cloning kit (Invitrogen) according to the manufacturer’s instructions. Cells were grown on low-salt Luria-Bertani (LB) agar plates (Invitrogen) for 18 to 24 h at 37°C. Colonies were picked randomly and transferred to 1.3 mL SOB-Zeocin medium (Invitrogen) and grown for 24 h at 37°C. Plasmids were purified using a QIAprep 96 Turbo Miniprep kit (Qiagen) using a QIAvac vacuum manifold"

Here's another paper:

"Diversity and Succession of the Intestinal Bacterial Community of the Maturing Broiler Chicken" (PDF)

"The amplified PCR products were purified with the Wizard PCR product purification kit (Promega, Madison, Wis.). The purified products were ligated into pGEM-T Easy (Promega). Ligation was done at 4°C overnight, followed by transformation into competent E. coli JM109 cells by heat shock (45 s at 42°C). The clones were screened for α- complementation of β-galactosidase by using X-Gal (5-bromo-4-chloro-3-indolyl- -D-galactopyranoside) and IPTG (isopropyl- -D-thiogalactopyranoside) (5)." ... "DNA preparations for sequencing were made with the QIAprep spin plasmid kit (Qiagen, Valencia, Calif.) as specified by the manufacturer. Plasmids were eluted with 50 μl of water and stored at 70°C."

What does it mean? One paper's using QIAquick, another one's talking about Wizard. There's some E. coli getting involved there as well... It's confusing. Actually those little paragraphs underlie a lot of lab work.

What is a Clone Library anyway?

Clone libraries are a method of separating rDNA from a PCR sample and creating enough copies to sequence. The basic premise is that rDNA fragments amplified by PCR are inserted into vector DNA (usually plasmids, a loop of DNA commonly found in bacteria), which are then taken up by Escherichia coli. In this context, "competent" means that the bacteria is able to undergo "Transformation" or take up DNA from another source and replicate it. The bacteria are plated out and screened in such a way that each colony on a plate is made up of bacteria carrying a vector with the same bit of rDNA. These can be cultured further to provide enough plasmids to sequence the inserted rDNA or stored for later analysis.

User:Spaully on English wikipedia, Plasmid (english), CC BY-SA 2.5
Purifying DNA from PCR

Although this step is considered optional, it can improve cloning results. At the end of PCR, you've got lots of rDNA, but also lots of other crud from the reactions like primers and enzymes. These and other nonspecific products of PCR can be separated from amplified 16S rDNA using agarose gel electrophoresis (1) or a commercial kits like Wizard or QIAquick⁠. 

Sticky vs Blunt ended PCR products

PCR using Taq polymerase will leave what is known as an A-overhang artefact on amplified DNA. This is an A residue attached to the 3 end of DNA string. This A-overhang is exploited in some commercial cloning kits to insert amplified rDNA into vector DNA with a complementary T-overhang (TA cloning) (2,3)⁠. This is a "sticky ended PCR product". It should be noted that purification of PCR products using agarose gel electrophoresis removes the A-overhang, so a short step of 3 adenylation is required after purification (1)⁠.

Vishnu2011, Tacloning, CC BY-SA 3.0
If you haven't used Taq polymerase, you don't have the A-overhang so your PCR products are blunt ended.

Vector DNA
The purpose of vector DNA is to stabilise and replicate an rDNA molecule within a bacterial host. All vector DNA must:

  1. Be able to replicate along with the inserted PCR amplicon.
  2. Contain unique restriction endonuclease cleavage sites.
  3. Contain a marker to distinguish vectors with inserted rDNA, and also distinguish between hosts without vectors.
  4. Be relatively easy to extract from the host cell (4)⁠.
The procedure for creating a clone library is outlined in the figure below.

1. Plasmids are mixed with 16S rDNA sequences (iii) and the two spliced together. The insertion point for the 16S rDNA is in the lacZ gene, which codes for the α subunit of β-galactosidase enzyme (ii). This results in insertional inactivation of the lacZ gene (2)⁠. The method for inserting the 16S rDNA into the plasmid will depend on the commercial kit which is being used. The plasmid also contains a gene for antibiotic resistance (i).

2. After insertion of 16S rDNA, the sample contains two kinds of plasmids: Plasmids with the 16S rDNA insertion (i) and plasmids without the 16S rDNA insertion (ii).

3. Escherichia coli are used as host bacteria and are stimulated to take up the plasmids. After this step, there is a mixture of 3 types of E. coli: Those with plasmid type (i), those with plasmid type (ii) and those with no plasmid (iii).

4. The bacteria are then cultured on a media treated with antibiotics to exclude bacteria type iii (2, 3). The media also contains an inducer for the lacZ gene (so it's expressed) and a substrate for β-galactosidase which turns blue when broken down. Colonies of bacteria which still have an competent lacZ gene due to failure of rDNA insertion into the plasmid will turn blue, allowing for selection of bacteria with a type (i) plasmid (2)⁠.

5. Bacteria from selected colonies are grown overnight in nutrient broth.

So there we have it! Now we've got a theoretically unlimited supply of our 16S rDNA which we extracted from the environmental sample and amplified using PCR. We can now extract the plasmid and inserted DNA using a commercial kit and then sequence it. But wait, as with any technique there are caveats...

Biases of Cloning

There is only one report of a potential bias introduced during the cloning procedure. This focused on comparing the two methods of inserting the 16S rDNA sequences into the plasmid vector, blunt end and sticky end cloning. It was reported that the two methods produced different results when screened using dot-blot hybridisation, however, no phylogenetic details are provided so it is difficult to draw conclusions (5)⁠.

Remember those Heteroduplexes and Chimeras?

Clone libraries may exacerbate the problem of heteroduplex molecules produced during PCR. During cloning in the host bacteria, E. coli DNA repair mechanisms identify the heteroduplex and attempt to repair the mismatched bases. In normal cells, DNA methylation identifies one strand as the correct parent strand. Since neither strand of the inserted DNA is methylated, the repair mechanisms randomly choose one to use a template. For each incorrect base pair, a different strand may be used as the template. The repaired sequences that result are composites of two original strands, referred to as ‘mosaics’ (6)⁠. These are harder to identify than chimeras and so will artificially increase the apparent phylogenetic diversity of a clone library.

Chimeras also present a problem when analysing clone libraries. An analysis of 17 large clone libraries (100 or more clones) of 16S rRNA genes submitted to public databases in 2005 found an average chimera content of 9.0% with one library containing 45.8%. Nine of the libraries had already been checked for chimeras using software (7)⁠. This highlights the importance screening sequences from PCR using reliable chimera hunting software.

How Reliable are Clone Libraries?

As a result of these biases, and those introduced previously by DNA extraction and PCR, it is worth questioning whether clone libraries provide an accurate representation of the qualitative and quantitative composition of microbial communities. Analysis of clone libraries must be viewed objectively and considered as only part of the puzzle of microbial ecology (8)⁠.

Most studies using clone libraries will examine no more than 100 clones, and while this may identify the main taxa present, it is unlikely to represent the true diversity of the orginal sample (8)⁠. While the clone library may not represent true diversity, they benefit from producing longer 16S rDNA fragments for sequencing which provide a greater phylogenetic resolution. Comparing results from different clone libraries is often confounded by the use of different hypervariable regions of 16S rDNA. In light of this, the results of studies using clone libraries should be considered as a semi-quantitative analysis which only superficially explore the true diversity of microbial communities (8)⁠.


1. Leigh MB, Taylor L, Neufeld JD. Clone Libraries of Ribosomal RNA Gene Sequences for Characterization of Bacterial and Fungal Communities. Handbook of Hydrocarbon and Lipid Microbiology. 2010. p. 3971–90.

2. Osborn M a, Smith CJ. Molecular Microbial Ecology. Vol. 51. 2009. 370 p.

3. Makkar, H. P S MCS. Methods in Gut Microbial Ecology for ruminants. 2005. 1-223 p.

4. Mullis KB. Recombinant DNA technology and molecular cloning. Sci Am. 1990;Chapter 8:26.

5. Rainey F a., Ward N, Sly LI, Stackebrandt E. Dependence on the taxon composition of clone libraries for PCR amplified, naturally occurring 16S rDNA, on the primer pair and the cloning system used. Experientia. 1994;50(9):796–7.
6. Thompson JR, Marcelino L a, Polz MF. Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by “reconditioning PCR”. Nucleic Acids Res. 2002;30(9):2083–8.

7. Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl Environ Microbiol. 2006;72(9):5734–41.
8. Stackebrandt E, Pukall R, Ulrichs G, Rheims H. Analysis of 16S rDNA clone libraries: part of the big picture. Proc 8th Int Symp Microb Ecol Microb Biosyst new Front Atl Canada Soc Microb Ecol Halifax, Nov Scotia, Canada [Internet]. 1999;1–9

Friday 2 December 2016

PCR Biases - Differential Amplification

As well as telling us how many bacterial species make up a microbial community, 16S rRNA gene analysis can give us an idea of the abundance of each species. It would be great if things were simple. Let's say I analyse 100 sequences from an environmental sample. My analysis shows that 50 sequences are from the phylum Firmicutes, 30 are Proteobacteria and 20 belong to Bacteroidetes. So logically, I can assume that of all the bacteria in the sample 50% are Firmicutes, 30% are Proteobacteria and 20% are Bacteroidetes. However, if I've used PCR to amplify the DNA in my sample, I have to assume that all of the genes are amplified equally. This assumption is wrong. Some sequences will be easier to copy using PCR than others. This discrepancy is called differential or preferential amplification. Differential amplification is a bias introduced by PCR which cannot always be corrected. Several factors have been identified as causing differential amplification of rDNA.

  1. The rRNA gene copy number (rrn operon number) and genome size differ between species.
      Bacteria can have between 1 and 10 copies of the rRNA gene within their genome (1)⁠. What’s more, the copy number of rRNA genes doesn’t necessarily correspond to a regular increase in PCR product so even if we knew the rrn operon number of each species, we couldn't correct for it. Other factors such as density of rRNA genes and the percentage of the genome composed of rRNA genes have also been theorised to affect the efficiency of PCR amplification (2,3)⁠. There are online databases that have information on the rrn operon number of different bacterial species. For example, here we can see that Lactobacillus acidophilus has 4  16S rRNA gene copies in its genome, according to two different studies.

  2. Not all rRNA genes from the same species have exactly the same sequence.
      By reviewing pairs of sequences from the same species in databases of rRNA gene sequences, it has been estimated that up to 48% of sequence pairs have more variation than would be expected from sequencing errors. This variation is different between taxa, so there is no easy mathematical correction for this observation (4)⁠. Another study of differences in sequence found that 16S rRNA gene sequences from strains of Paenicabillus polymyxa differed from each other by one to eight nucleotides at ten places in the V6 to V8 regions (5)⁠. Intraspecific heterogeneity (differences within the same species) can complicate the quantification of bacteria and lead to an overestimation of diversity (1, 2)⁠.

  3. Differences in G+C content between sequences
      The G-C content of a DNA sequence is the proportion of base pairs that are G-C instead of A-T. The G-C content is important as it defines how stable a DNA molecule will be at higher temperatures. This is the central tenant on which temperature and denaturing gradient gel electrophoresis separates DNA molecules based on their sequences. Basically, DNA molecules with a higher G-C content are more thermostable than those with a low G-C content. This is because of the stacking of the base pairs, which is beyond the scope of this article, but keen biochemists can read up about it here. rDNA sequences that have a lower G+C content denature in the PCR and so may be preferentially amplified. This effect can be reduced by adding 5% acetamide which also stops primers from binding preferentially to different DNA strands (6)⁠.

      The top strand has a lower G-C content than the bottom one and so will denature more readily.
  4. Sequences outside the rRNA gene can inhibit amplification.
      Other DNA sequences and secondary structural features of the bacterial genome that serves as the original template can inhibit PCR amplification of the rRNA gene.  DNA isn't a straight molecule. It curls up on itself and has other proteins bound to it. These secondary structural features can physically get in the way of primer binding. The inhibitory effect of these secondary structures varies depending on which variable section is targeted by the primers (7–9)⁠. One group found they couldn't overcome the inhibitory effect of  by using DNA denaturing cosolvents such as DMSO and glycerol or other techniques such as touchdown PCR. Instead, they suggested that the effect can be minimised by using at least two primer sets targeting different variable sections of the rRNA gene in separate PCRs, then comparing the results (7)⁠.

      A strand of DNA wrapped around a DNA binding protein which could obstruct PCR amplification
      Thomas Splettstoesser, Nucleosome1, CC BY-SA 3.0
  1. Increasing template concentration reduces the rate of amplification.
      In the typical description of PCR, the DNA strands denature and the primer binds. But what's stopping the DNA strands from just reannealing to each other instead of a primer once the temperature drops? The answer is... nothing, except that usually the other DNA strand has floated away a bit and the nearest thing to bind to is a primer. However, a critical concentration of template DNA exists at which reannealing of DNA strands is favoured over primer binding. When the concentration of template DNA reaches and goes over this critical concentration, amplification is reduced. This allows other rDNA templates to be more effectively amplified in subsequent PCR cycles and will alter the relative abundance of rDNA sequences within the sample. This amplification bias is less likely to occur in samples with a wide variety of rDNA sequences at relatively low concentrations (9)⁠.

  2. Specificity of primers to the template DNA.
      Even if universal primers are used, there is evidence to suggest that there is differential binding between primers and template DNA from different bacterial species. Even single mismatches between primers and template DNA can reduce binding (10)⁠. Suboptimal binding will result in decreased amplification of the respective template compared to others (11)⁠. While lowering the annealing temperature will allow for mismatches, it can increase non-specific primer binding and unwanted products (12)⁠.
  1. DNA contamination of PCR.
      Introduction of DNA to the sample can occur either through unintentional transfer of DNA from previous amplifications (tube-to-tube contamination) or by contamination of PCR reagents (11)⁠. This is a particular problem for reagents such as DNA polymerase whose manufacture involves the use of Escherichia coli (1)⁠. To protect against this, a negative control must always be included which is handled the same as other samples, except that no template DNA is added. Reagents should also be pre-treated with UV light or uracil DNA glycosylase to remove contaminating DNA (13)⁠.
Let's have a look at this paper investigating poultry intestinal bacteria using denaturing gel gradient electrophoresis. They have this to say about their PCR:

"Primers7 (50 pmol of each per reaction mixture; primer 2, 5′-ATTACCGCGGCTGCTGG-3′, and primer 3 with a 40-base G-C clamp (Sheffield et al., 1989; Muyzer et al., 1993), 5′-CGCCCGCCGCGCGCGGCGGGCGGGG CGGGGGCACGGGGGGCCTACGGGAGGCAGCAG- 3′) were mixed with Jump Start Red-Taq Ready Mix,5 according to the kit instructions, 250 ng of pooled (50 ng/ chicken) template DNA from five chickens in each group, and 5% (wt/vol) acetamide to eliminate preferential annealing (Reysenbach et al., 1992). Amplifications were on a PTC-200 Peltier Thermal Cycler8 with the following program: 1) denaturation at 94.9°C for 2 min; 2) subsequent denaturation at 94.0°C for 1 min; 3) annealing at 67.0°C for 45 s, −0.5°C per cycle [touchdown to minimise spurious by-products (Don, 1991; Wawer and Muyzer, 1995)]; 4) extension at 72.0°C for 2 min; 5) repeat steps 2 to 4 for 17 cycles; 6) denaturation at 94°C for 1 min; 7) annealing at 58.0°C for 45 s; 8) repeat steps 6 to 7 for 12 cycles; 9) extension at 72.0°C for 7 min; 10) 4.0°C final."

Although they've taken precautions (highlighted in bold) to minimise certain factors that contribute to differential amplification it's impossible to correct for others, such as a different rrn operon number or intraspecific heterogeneity. In light of this, any experiment using PCR will introduce some biases and won't produce a 100% accurate picture of the microbial community being studied.

Although PCR is an imperfect technique, it is currently the only reliable way of amplifying DNA from environmental samples. After amplification, the DNA from a sample can either be analysed directly using fingerprinting techniques such as DGGE, TGGE and T-RFLP or individual DNA fragments can be sequenced to identify the bacteria present and build phylogenetic trees. While modern sequencing platforms like Illumina and 454 pyrosequencing require no additional steps after PCR, older studies which relied on chain-termination sequencing had to build clone libraries of sampled DNA. The creation of a clone library is a lengthy process and can also introduce biases which affect results.


1. Osborn M A, Smith CJ. Molecular Microbial Ecology. Vol. 51. 2009. 370 p.

2. Stackebrandt E, Pukall R, Ulrichs G, Rheims H. Analysis of 16S rDNA clone libraries: part of the big picture. Proc 8th Int Symp Microb Ecol Microb Biosyst new Front Atl Canada Soc Microb Ecol Halifax, Nov Scotia, Canada

3. Farrelly V, Rainey F a, Stackebrandt E, Farrelly V, Rainey F a. Effect of genome size and rrn gene copy number on PCR amplification of 16S rRNA genes from a mixture of bacterial species . These include : Effect of Genome Size and rrn Gene Copy Number on PCR Amplification of 16S rRNA Genes from a Mixture of Bacterial S. 1995;61(7):2798–801.

4. Clayton RA, Sutton G, Hinkle Jr. PS, Bult C, Fields C. Intraspecific variation in small-subunit rRNA sequences in GenBank: why single sequences may not adequately represent prokaryotic taxa. Int. J. Syst. Bacteriol. 1995;45:595–9.

5. Nubel U, Engelen B, Felske A, Snaidr J, Wieshuber A, Amann RI, et al. Sequence Heterogeneities of Genes Encoding 16S rRNAs in Paenibacillus polymyxa Detected by Temperature Gradient Gel Electrophoresis. J Bacteriol. 1996;178(19):5636–43.

6. Reysenbach AL, Giver LJ, Wickham GS, Pace NR. Differential amplification of ribosomal RNA genes by polymerase chain reaction. Appl Env Microbiol [Internet]. 1992;58(10):3417–8.

7. Hansen MC, Tolker-Nielsen T, Givskov M, Molin S. Biased 16S rDNA PCR amplification caused by interference from DNA flanking the template region. FEMS Microbiol Ecol. 1998;26(2):141–9.

8. Rainey F. A., Ward N, Sly L. I., Stackebrandt E. Dependence on the taxon composition of clone libraries for PCR amplified, naturally occurring 16S rDNA, on the primer pair and the cloning system used. Experientia. 1994;50(9):796–7.

9. Suzuki MT, Giovannoni SJ. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. 1996;62(2):2–8.

10. Dahllöf I. Molecular community analysis of microbial diversity. Curr Opin Biotechnol. 2002;13(3):213–7.

11. Wintzingerode F, Göbel UB, Stackebrandt E. Determination of microbial diversity in environmental samples: pitfalls of PCR-based analysis. FEMS Microbiol Rev. 1997;21:213–29.

12. Ishii K, Fukui M. Optimization of Annealing Temperature to Reduce Bias Caused by a Primer Mismatch in Multitemplate PCR. Appl Environ Microbiol. 2001;67(8):3753–5.

13. Niederhauser C, Höfelein C, Wegmüller B, Lüthy J, Candrian U. Reliability of PCR decontamination systems. Genome Res. 1993;4(2):117–23.