Behind the paper: 16Stimator

In scientific manuscripts, we tell stories of our research, generally in straight-line fashion with clear motivations and results. This type of research is rare (in my experience), with stories, motivations, and applications only realized post hoc. This is the nature of science, and our recent ISMEJ publication is no different.

With 16Stimator: statistical estimation of ribosomal gene copy numbers from draft genome assemblies, we introduce an exciting method to generate 16S rRNA gene (16S) copy number estimates for bacterial genomes based on comparison of sequencing read depths of ribosomal and single copy gene regions. Application of this method resulted in 16S copy number estimates for hundreds of bacterial species without closed genome representatives. This extended database of known 16S copy numbers combined with phylogenetic based normalization methods [PICRUSt] for 16S amplicon sequencing studies will lead to more accurate organismal abundance measurements. Note: Our article is not open access but the code is freely available.

These are valid and important motivations and applications, but really, we just wanted to know the 16S copy numbers for a handful of isolates so we could properly measure their abundances by amplicon sequencing in controlled community studies.

So here is the actual development route of 16Stimator:

A caveat of 16S amplicon sequencing studies is that, due to variation in bacterial 16S copy number, sequencing read and organismal abundances are not equivalent. For our controlled community experiments using leaf endophytic bacteria originally isolated from Arabidopsis thaliana, we needed to determine each isolate’s 16S copy number. We chose whole genome sequencing and assembly for this task. That was a horrible choice.

Current assembly algorithms do a poor job resolving repetitive genomic regions. Longer reads or larger insert sizes can overcome this limitation, but alas, we had short read, Illumina sequencing libraries with insert sizes smaller than ribosomal rRNA gene regions. After assembly, the 16S rRNA gene was found in one to few contigs. When we mapped reads back to the assembly, the coverage of the 16S contig was much greater than the average genomic coverage, so we sought to use read-depths to resolve 16S copy numbers. By statistical coverage comparisons of 16S to single copy, conserved genes, we were able to accurately estimate copy numbers.

16Stimator pipeline overview.
16Stimator pipeline overview.

Though the focus of the paper is on the sequencing read-depth approach, we did confirm 16S copy numbers experimentally, using an efficient qPCR approach. We compared amplification of 16S to single copy, conserved genes to determine copy number. The IDT-DNA gBlocks provided a convenient alternative to plasmid construction for creating standards with a 1:1 ratio of 16S to single copy gene.

16S copy number estimates from de novo assemblies. For each endophytic isolate, paired-end sequencing reads (R1, R2) were generated on the Illumina HiSeq 2000 from short (~250 bp) and long (~2500) insert libraries (Short_Ins and Long_Ins, respectively). For closed-genome controls, similarly generated sequencing reads were downloaded from SRA: Escherichia coli TY-2482 (GCA_000217695.2, SRR292678, SRR292862), Bacteroides fragilis HMW 615 (GCA_000297735.1, SRR488169, SRR488170), Pseudomonas aeruginosa PAO1 (GCA_000006765.1, SRR032420, SRR032832) and Staphylococcus aureus KPL1828 (GCA_000507725.1, SRR835799, SRR958927). The 16Stimator pipeline was used to estimate 16S copy number as the ratio of median coverage for 16S and single-copy genes. Confidence intervals (95%) were either calculated as in Price and Bonett (2002) (PB), or via permutations (Perm). For endophytic isolates, 16S copy numbers were independently verified by absolute quantification via qPCR with the mean and standard deviation of technical replicates shown. For closed-genome controls, each horizontal line marks the rrnDB (Stoddard et al., 2014) consensus 16S copy number for each species. Note: the short-insert library for MEDvA23 and the long-insert library for MEB061 did not meet quality thresholds. 16S copy number was not experimentally determined by qPCR for E. coli TY-2482, B. fragilis HMW 615, P. aeruginosa PAO1 and S. aureus KPL1828.

Only after resolving 16S copy numbers for our isolates of interest did we realize that this method could be applied to thousands of other draft genomes. We scaled 16Stimator to process tens of thousands of sequencing libraries deposited in SRA, resulting in 16S copy number estimations for hundreds of species without closed genome representatives. A large and diverse database of 16S copy numbers combined with methods to correct for copy number bias in 16S amplicon sequencing studies will ultimately result in more accurate abundance and diversity estimates. If sequencing reads are publicly deposited along with draft genome sequences, then the database can continue to grow.

Though we did not initially intend to create a method to estimate 16S copy numbers from draft genomes, science threw us a curveball and 16Stimator was our response. All the scripts and data are publicly available at https://bitbucket.org/perisin/16stimator. We look forward to feedback on our method to continue to improve and generate 16Stimates!

This post first appeared on microbe.net