Background

In recent years, genomics has evolved from an independent research discipline into a tool with widespread application both in the basic biological research community, as well as in the public health sector. Genomic and metagenomic approaches are being applied to study individual isolates and entire communities of viral, bacterial and eukaryotic organisms found in environmental and clinical samples. Currently, a NIH roadmap initiative, the Human Microbiome Project, is underway to study complex microbial communities from different body niches associated with health and disease. Other applications targeted at the human host include genome wide associates studies (GWAS) or RNA-Sequencing (RNA-Seq). In the future, sequencing-based applications are expected to become increasingly relevant in clinical diagnostics and forensic science. With the advent of large-scale affordable new sequencing platforms, genomics applications that have long been confined to few large sequencing centers, have now become accessible to smaller research laboratories and health care providers. However, the need for extensive technical infrastructure, including computational hardware and expertise to install and maintain complex software pipelines that process large amounts of sequence data poses a significant bottleneck and hurdle for genomics applications. Our goal is to provide push-button automated analysis pipelines for non-technical users through the use of two technologies: Virtual Machines and Cloud Computing.

Virtual Machines

A Virtual Machine (VM) is a piece of software that encapsulates an entire operating system and can be bundled with pre-installed and pre-configured software. A VM can be distributed over the Internet and executed anywhere in the world. Most importantly, virtualization bypasses the time-consuming and complicated process of installing and configuring software packages directly on the host operating system. The VM concept allows us to remove a significant bottleneck in current and next-generation sequence analysis: the complexity, platform-dependency and maintenance of bioinformatics tools and sequence analysis pipelines. The CloVR VM runs as a virtual applicance on Cloud Computing platforms and user desktops.

Cloud Computing

Cloud Computing represents a variant of grid and cluster computing that provides easy, dynamic access to computer resources over the Internet. In addition, Cloud Computing providers have embraced the use of Virtual Machine technology, thus providing application developers tremendous flexibility over what can be run “in the Cloud”. Access to large computational infrastructure, such as Clouds, is increasingly necessary for analysis with the current generation of genome sequencers. The CloVR VM currently supports the Amazon EC2 Cloud and Nimbus Science Clouds.