Quality of mammoth mitogenomes from different high-throuput DNA sources

by Régis Debruyne

Published abstract

With the rise of the second generation techniques of sequencing, the illusion has spread that the sequence data produced in the field of paleogenetics/paleogenomics would soon become error-free. It has repeatedly been claimed that the new methods available were more « performant » than their predecessors to overcome the inherent difficulties tied to ancient DNA. More than the techniques themselves, one must acknowledge that the fantastic sequencing depth they allow, largely improves our potential to decipher both contamination and post-mortem chemical alteration in our sequences.

However, the new techniques have also been accompanied with new recurrent types or sequencing errors (like homopolymer length) and with average levels of sequencing errors at the individual nucleotide level much higher than observed with « old-fashioned » capillary Sanger sequencing. Furthermore, the availability of huge sequence datasets has led to a serious behavioral shift for the end-user to which individual sequence-read analysis have been replaced with mass-statistic validation, more and more remote from the actual nucleotidic sequence produced.

No paleogenomic case-study of the actual quality of the sequencing data comparing the final « consensus » assemblage based on different sequencing techniques has been performed to date. In order to address this topic, all available Pleistocene mammoth mitogenomes have been analyzed. The dataset of over 25 mammoth mitogenomes provides indeed a unique paleogenomic framework, where different amplification and sequencing strategies were used, encompassing Sanger capillary sequencing, 454 sequencing, Illumina sequencing, and a blend of these techniques in certain cases.

Based on an analysis at the individual nucleotide level, a systematic review of the quality of those genomes is proposed, and pinpoints the drastic effect of sequencing coverage . It surveys (I) point variation errors and false polymorphism, (II) homopolymer stretches length, (III) potential chimeras due to contamination of libraries/extracts, (IV) errors in sequence assembly.  The results of phylogenetic analyses and molecular clock dating are compared pre- and post- removal of the putative errors.

Leave a Reply

Your email address will not be published. Required fields are marked *