Cunningham & Autoscaling

Several CloVR tracks now utilize autoscaling in the cloud to improve performance. Efficient autoscaling can be difficult to achieve and requirements vary for each pipeline. If we underestimate the number of instances needed, the pipeline will take significantly longer. If we overestimate how many instances we need, then we end up wasting resources (and money in the case of Amazon EC2).

In our workflows, some form of BLAST often takes up the majority of cpu time. Fortunately this process can be parallelized fairly easily by partitioning the query dataset. But a few questions arise:

  • How much should we partition the query data?
  • How many parallelized jobs do we schedule?
  • How many instances do we need to request without being wasteful?

To help answer these questions in CloVR, we’ve developed a BLAST runtime estimator called Cunningham. Cunningham computes statistics about the shared sequence composition between a database and a query dataset in order to estimate how many cpu hours a corresponding BLAST job would take. With this information, each pipeline determines how many instances it will need overall and how to partition the data.

Currently, we support BLAST{N,P,X} runtime estimates against several well known databases including: NCBI COGs, eggNOG, KEGG genes, NCBI-NR, SILVA, & RefSeq (microbial genomes). Cunningham is lightweight and quite fast, typically requiring a few minutes to run. If you’re interested check out the Cunningham white paper in Nature Precedings.

The latest version of Cunningham is freely available through SourceForge.

This entry was posted in Blog. Bookmark the permalink.