496code: SPATS Performance Optimization


We currently maintain the open-source SPATS codebase. SPATS is a tool that analyzes collections of (typically millions of) DNA sequences, determining how they align to an experimental template and gathering statistics.

The previous implementation of SPATS used industry-standard tools for sequence alignment, written in C for "optimal" performance. For the particular needs of the SPATS experiment, several tools needed to be combined; a typical analysis took rougly 3.5 hours.

After understanding the problem, we developed custom algorithms that specifically addressed the needs of the SPATS analysis. Our Python implementation performed the same analysis in under 4 minutes. For fun, we created a C++ version to do the same analysis: bulk fread(), FASTQ-tuned parser, lockless multithreading, custom sequence-lookup data structures. The code was limited by disk read speed -- in other words, we were able to perform alignments faster than we could read sequence data from the FASTQ files. This final version performed the same analysis in 13.5 seconds.

This demonstrates the kinds of 1000x speedups that are possible when a full-system understanding is applied to performance optimization.


Home | Contact: info@496code.com