library
The Bauxite Mines (1942), mural by Julius Woeltz for the U.S. Post Office in Benton, Arkansas
Homer, The Iliad, Book XII, line 493. Bryant's translation.
We are living in the Golden Age of genome engineering. It is now possible for researchers to read DNA sequences, and then directly modify existing DNA sequences to investigate gene function, or create new DNA sequences to synthesize novel biological organisms. These advances are due to technlogical developments over the last decade. Furthermore, cost and time required to sequence, or read, genomes have fallen by orders of magnitude, from tens of millions of dollars and years to less than a thousand dollars and hours. As a result, precision medicine is a reality. Similarly, the time and cost to modify DNA sequences has fallen dramatically, much of the improvement due to the discovery and adoption of the CRISPR/Cas9 system.
Systems for genome engineering are often introduced into target cells using plasmids. Mainly found in bacteria, natural plasmids are small, circular, mobile pieces of DNA that replicate independently from the host's chromosomal DNA, and provide some benefit to the host, for example, resistance to antibiotics. Researchers now routinely engineer plasmids to study the function and expression of genes, for targeted delivery of medicines into cells, and recently for diagnostics. As a result, plasmids have become important for biological research, biomedicine, national security, and public health purposes.
Addgene, a non-profit repository created to help scientists share plasmids, improves access to these materials and related information to accelerate research and discovery. During their quality control process, Addgene sequences the DNA of all plasmids. During DNA sequencing, the plasmid DNA is broken into fragments, and the sequence of each end of the fragment, termed a "read", is produced in pairs. Then, to determine the plasmid DNA sequence, these read pairs need to be "assembled". The DNA sequencing technology adopted by Addgene produces so-called "short reads" which makes assembly more difficult. It is this difficulty that introduces a bottleneck in the Addgene quality control process. Thus, a new assembly approach is required.
Déjà Vu: The Novella
Currently, no publicly available, automated assembly approach exists that can overcome the difficulties of assembling plasmid short-read sequencing data. In a project we call Déjà Vu: The Novella, IQT Labs partnered with Addgene to explore how this bottleneck can be addressed.
We set out to collect curated plasmid sequences and corresponding short-read sequencing data, select and compare candidate assemblers, and build a scalable assembly pipeline. Addgene's collection of curated sequences, manually inspected, or created, by a team of top-level scientists, is among the largest and best anywhere. At IQT Labs' request, Addgene collected all curated sequences and short-read sequencing data as well as the initial plasmid sequences produced by their sequencing provider for all plasmids deposited during 2018. In total, 14,826 curated and initial sequences and short-read data sets were collected.
Addgene scientists identified several candidate assemblers:
- SPAdes, the St. Petersburg genome assembler
- Shovill, based on SPAdes
- NOVOPlasty, an organelle assembler and heteroplasmy caller
The IQT Labs team identified a number of other candidate assemblers:
- Unicycler, also based on SPAdes
- SKESA, Strategic Kmer Extension for Scrupulous Assemblies
- MaSuRCA, the Maryland Super Read Cabog Assembler
This set of assemblers provided a good selection of the common graph algorithms used for assembly, showed promising initial performance, and came recommended by researchers developing assemblers. Note that since plasmids are circular DNA, some assemblers require an additional step to complete (or "circularize") the assembly.
Given the scale of the data and number of assemblers, we realized that to process each of the short-read data sets for each of the six assemblers we needed to deploy each assembler either to a machine with many cores or to a cluster of machines. Based on an earlier investigation by IQT Labs at Addgene, the IQT Labs team selected Toil, a scalable, pipeline management system developed in Python at the UC Santa Cruz Genomics Institute. To deploy assemblers to our development and processing environments, we decided to use Docker containers.
Biocontainers
The first step in using Docker containers is to either find an image or create a Dockerfile to build an image. In looking for a good source for both, IQT Labs identified Biocontainers, a community-driven project that provides the infrastructure and guidelines to create, manage and distribute bioinformatics packages (using Conda) and containers (using Docker or Singularity). We liked the extensive list of tools (over 8,000), the partner organizations (like Bioconda, and Nextflow), and documentation. We customized the Biocontainers base image (based on Ubuntu 20.04) to include common processing and development resources. Then we standardized our Docker files, building software, when possible, and forking repositories, when needed for versioning.
Toil
We found Toil to be a great framework for prototyping a bioinformatics pipeline, especially for handling input and output files locally, or remotely, and scaling by using all available cores, or across a cluster. Toil is not bound to any bioinformatics codebase, allows for defining task dependencies from within task methods, and offers a strong focus on cloud execution. We isolated each tool to a custom job class, allowed data to be loaded from the local filesystem, or from S3, and provided a command line interface to each job. Finally, we used the Python unit testing framework to provide a test for each job.
What we found
We deployed the assembler containers, and job classes to an AWS EC2 c5.24xlarge instance with 96 vCPU. Happily, the 96 vCPUs mapped well to the short-read data sets which were collected using 96 well plates. Using our processing framework, we were able to assemble ten plates, or 960 plasmids, in one hour and 14 minutes, at a cost of approximately $5. Truly light was the task where many shared the toil!
Table 1 summarizes the performance of the selected assemblers. The values in the table give the percent of plasmids which assembled or aligned to Addgene's curated sequence. For those assemblers that required an additional step to circularize the assembly, corresponding values are shown after that step. In the table, the Standard Operating Procedure (SOP) planned by Addgene involves trying a selected set of assemblers in order and stopping at the first circularized assembly. The results indicate that no assembler works for all cases; although the SOP succeeds 82.9% of the time, which is adequate for operational purposes, but means that for 17.1% of the cases manual intervention is required, which reduces the efficiency of the Addgene quality control process.
Table 1: Performance of selected assemblers and assembly SOP
AssemblyCircularized AssemblyAssemblerAssembledAlignedAssembledAlignedMaSuRCA88.018.418.516.2NOVOPlasty68.559.8Shovill92.239.838.131.6SKESA90.518.5SPAdes91.552.152.141.7Unicycler92.213.3SOP (ensemble)92.373.982.975.7
What's next?
Since it is known that repeated subsequences common in plasmids cause assembly issues, we believe a new assembler approach could be developed based on a combination of a recent repeat finding approach implemented in the tool REPdenovo, and existing assembly approaches applied here. We hope that this new approach would successfully assemble many of the plasmids in the remaining 17.1% of the cases. Whether that is true or not remains for others to determine. Nevertheless, the combination of the pipeline framework Toil, container technology Docker, and Biocontainers demonstrates one fruitful and reproducible approach for bioinformatics. In future work, we plan to investigate another pipeline framework, Nextflow.