CloVR Blog

CloVR and IonTorrent

Great post about using CloVR for microbial genome annotation Microbial Annotation in the Cloud: CloVR and Ion Torrent

Posted in Blog | Comments Off

CloVR 454 and Illumina assembly

Assembly of genomes in CloVR is currently handled via two separate pipelines; the clovr_assembly_celera pipeline operates on sequences generated via a 454 machine (FLX or Titanium) while the clovr_assembly_velvet pipeline handles sequences generated from Illumina GA machines. This blog post will detail the procedures required to run assembly using both of these pipelines. The two datasets used in the this demonstration follow:

Using the CloVR Dashboard to run assembly pipelines

The CloVR dashboard is a new graphical interface that makes running pipelines extremely easy. The dashboard can be found by navigating a web-browser to the CloVR VM’s ip address like so:

http://[yourclovrvmip]/clovr

Once loaded the page should look like the following:

This is the dashboards main page from which we can configure our pipeline. We can start by tagging our data set by hitting the Add button found in the lower left corner of the dashboard (Denoted in the figure by the number one). Clicking on the add button presents us with a new dialog window.

We can select data found under the /mnt/data/user_data/ folder here via the Select file from image button or can upload data from another location on our computer by using the Browse… button. Once a file has been uploaded it can be checked off in the right-hand pane of the dialog box. For this demonstration I will be running the 454 assembly pipeline so the filetype of the file is set to SFF via the drop down File Type menu. A name and description should also be given to the dataset. Once the proper data has been entered the dataset can be tagged by clicking the Tag button.

We should now see our dataset in the CloVR dashboard dataset left-hand pane and can proceed to configure our pipeline. We run the 454 assembly pipeline through the CloVR microbe pipeline:

The dataset we just tagged can be selected from a drop-down menu and we must select the Assembly Only option as we are just interested in assembly. Any other settings pipeline settings may also be changed at this point. Once satisfied we can hit theValidate button to ensure that our configuration is ok followed by the Submit button to submit the pipeline to run.

The pipeline steps can be monitored through the dashboard with more fine-grained messages concerning the status of each individual step available by clicking on a specific pipeline that is running.

Once the pipeline has complete successfully output can also be downloaded via the data sets pane on the left-hand side of the dashboard.

Command-line 454 Assembly in CloVR

Assembly of sequences generated via a 454 machine are handled via the clovr_assembly_celera pipeline. As the name indicates, this pipeline makes use of the celera assembler software (version 5.4) in assembling sequences.

To demonstrated assembly in CloVR I will be running this pipeline on a an E.coli data set containing 500,000 reads (the data set is included at the bottom of this blog post in the appendix) and executing the run on commodity home/office hardware:

Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz (2 Cores)
MemTotal:      3980400 kB (4GB)

1.) Preparing input and pipeline configuration

To start of with we’ll want to be logged into a running CloVR image. Datasets in CloVR must be ‘tagged’ before they are ready for use. In essence tagging a dataset is akin to associating a unique identifier to it which any CloVR component can then call upon to retrieve the data. Tagging is done via the vp-add-dataset command:

vp-add-dataset --tag-name=454_assembly_test /mnt/data/454_500k.sff -o

If the command has executed you’ll receive some status information regarding the tagging process:

Task: tagData-1296774703.25 Type: tagData       State: completed     
Num: 1/1 (100%) LastUpdated: 2011/02/03 23:11:43 UTC

With our data tagged we can move onto configuring our pipeline through the use of a configuration file generated by the vp-describe-protocols command:

vp-describe-protocols -p clovr_assembly_celera \
  -c input.INPUT_SFF_TAG=454_assembly_test \
  -c params.OUTPUT_PREFIX=454_500k \
  -c cluster.CLUSTER_NAME=local \
  -c cluster.CLUSTER_CREDENTIAL=local \
  -c cluster.TERMINATE_ONFINISH=false \
  -c pipeline.PIPELINE_DESC="CloVR 454 assembly test" \
   > /mnt/data/clovr_assembly_celera.conf

The output of this command is a configuration file for the CloVR 454 assembly pipeline configured to use the dataset we tagged above. Options are configured using the -c flag and must match a parameter found in the configuration file. Some common parameters are described below; please note that not all of these parameters are required to run the 454 assembly pipeline.

  • input.INPUT_SFF_TAG - Remember our first step where we tagged our input file (I named mine 454_assembly_test)? That value is entered under this option.
  • params.OUTPUT_PREFIX – The output prefix is used when naming files generated by the assembly pipeline. It is practical here to use the organism name or some other descriptive text such as the ecoli500k value I will be using.
  • params.TRIM – This option controls how stringently we ignore low-quality base pairs past the trim points set by the 454 instrument. The recommended value to use here is chop, which does not use any base pairs past the trim points.
  • params.CLEAR – Controls where the clear range (or trim points) are specified for our input sequence data. 454 is the recommended setting here
  • params.LINKER – The linker sequence to use. Set to either flx or titanium depending on the instrument used to generate sequence.
  • params.INSERT_SIZE – This option should only be used when dealing with paired end datasets, which this dataset is. The format for this option is ‘i d’ where mates are on average i +/- d base pairs apart. Our example dataset will use the value ‘8000 1000
  • cluster.CLUSTER_NAME – Set to local if this pipeline is being run locally (on a single machine or a local grid) or to the name of a pipeline created using the vp-start-cluster command. We’re running this pipeline on our desktop so I will keep this parameter set to the value of local
  • cluster.CLUSTER_CREDENTIAL – This parameter should be set to local if this pipeline is being run on a local machine/grid and the value of our vp-add-credential command. Keeping this one on local to indicate a local run.
  • cluster.EXEC_NODES – This command controls the number of nodes brought up in our cluster. If this is a local run it is recommended that this number not exceed the number of CPU cores available. If being run on EC2 this will control how many EC2 instances are brought up. For this assembly example I will have this value set to 0 as we are running on one machine.

2.) Running the pipeline

Once all our configuration parameters have been set to the satisfactory options we can start our pipeline by invoking the vp-run-pipeline command:

vp-run-pipeline --print-task-name \
       --pipeline-config /mnt/data/454_blog/clovr_assembly_celera.conf \
       --overwrite

If the command is successfully executed a task number should be returned which can be used to monitor the overall status of the pipeline.  We can view our progress of the pipeline using the Ergatis interface that is installed on the CloVR VM. Ergatis display our pipeline in a block-like visualization with each analysis step displayed as a component of the pipeline. In order to view Ergatis we must first obtain the IP address for our cluster using the vp-describe-cluster command:

vp-describe-cluster

STATE   running
MASTER  local   clovr-10-90-135-181     running
GANGLIA http://clovr-10-90-135-181/ganglia
ERGATIS http://clovr-10-90-135-181/ergatis

The URL to the Ergatis interface running off of our image will be listed in the output to the vp-describe-cluster command under the ERGATIS parmeter. Taking this and tossing it into a web-browser should take us to the following page:

From here clicking on the clovr link on the left hand navigation will take us to a list of all our current running pipelines. The pipeline we just started should be here and viewable in the component-by-component view Ergatis is capable of displaying.

It is also possible to monitor cluster usage during a pipeline run by using the URL provided in the GANGLIA parameter from the vp-describe-cluster output:

3.) Pipeline Output

Once our assembly pipeline run has finished CloVR automatically downloads any output back to our VM and deposits it to the output directory defined in the pipeline configuration file. Navigating to this directory should provide a tarball file containing our assembly scaffolds as well as a QC file containing metrics for the assembly (i.e. N50).

Command-line Illumina Assembly in CloVR

Illumina assembly in CloVR follows the same workflow as the 454 assembly did save for running the data through a separate pipeline and having to run the pipeline on Amazon’s EC2 compute cloud.

1.) Preparing input and pipeline configuration

This example will make use of an input data set of 4 million paired-end reads (provided in appendix at the bottom of this post) that must be tagged as input just like the 454 example data set.

vp-add-dataset --tag-name=ecoli_4M_fastq \
                 /mnt/data/illumina_4M_1.fastq \
                 /mnt/data/illumina_4M_2.fastq -o

Before moving onto generation and modification of the pipeline configuration file it should be noted that due to the size our input dataset it will require a machine with at least 16GB of RAM. During peak processing of this dataset our RAM usage will eclipse 11GB and this will lock up machines that do not have this required amount of memory. Because of the memory situation associated with this dataset we will be making use of Amazon’s EC2 compute cloud and the m2.xlarge instance type containing 16GB of memory.

In order to do this we must generate an EC2 cluster credential using the vp-add-credential command coupled with the private key and certificate provided with an Amazon AWS account. Instructions to register for an Amazon AWS account and place the subsequent keys can be found here.

Once our keys are in place we can create our cluster credential by using the following command:

vp-add-credential --cred-name illumina_ec2 --cert /mnt/keys/ec2.cert \
                  --pkey /mnt/keys/ec2.pkey

With our credential create we can move onto generating our pipeline config by running the vp-describe-protocols command with the key difference being that our pipeline name is clovr_assembly_velvet:

vp-describe-protocols -p clovr_assembly_velvet \
       -c input.SHORT_PAIRED_TAG=ecoli_4M_fastq \
       -c cluster.CLUSTER_NAME=illumina_ec2 \
       -c cluster.CLUSTER_CREDENTIAL=illumina_ec2 \
       -c cluster.MASTER_INSTANCE_TYPE="large" \
       -c cluster.TERMINATE_ONFINISH=false \
       -c pipeline.PIPELINE_DESC="CloVR illumina test run" \
       > /mnt/data/clovr_assembly_velvet_ec2.conf

An explanation of the parameters used follows:

  • input.SHORT_PAIRED_TAG – This parameter tells the pipeline where to find the dataset we created when running the vp-add-dataset command. It should be set to whatever was specified in the tag-name flag and for this example will be set to illumina_assembly_test.
  • cluster.CLUSTER_NAME – A unique name given to our cluster. If a local cluster is being run this should be left as the default value of local. For our example I will set this to illumina_ec2_cluster.
  • cluster.CLUSTER_CREDENTIAL – Here we will use the credential we generated above, illumina_ec2
  • cluster.MASTER_INSTANCE_TYPE – This parameter controls the type of master node used in an EC2 cluster. Because this Illumina run requires a high memory machine we set the parameter to large to indicate we want to use a machine with 16GB of memory.

If we had a local machine with 16GB of memory or more upon which we could run this pipeline on our call to vp-describe-protocols would look like this:

vp-describe-protocols -p clovr_assembly_velvet \
       -c input.SHORT_PAIRED_TAG=ecoli_4M_fastq \
       -c cluster.CLUSTER_NAME=local \
       -c cluster.CLUSTER_CREDENTIAL=local \
       -c cluster.TERMINATE_ONFINISH=false \
       -c pipeline.PIPELINE_DESC="CloVR illumina test run" \
       > /mnt/data/clovr_assembly_velvet_local.conf

The three parameters different between this and our EC2 run being cluster.CLUSTER_NAME, cluster.CLUSTER_CREDENTIAL, and cluster.MASTER_INSTANCE_TYPE. Here we do not need a master instance type and our cluster name and cluster credential need to be set to local.

2.) Running the pipeline

With our configuration file in place we can go ahead and launch our pipeline using the vp-run-pipeline command:

vp-run-pipeline --print-task-name \
  --pipeline-config /mnt/data/illumina_ec2_blog/clovr_assembly_velvet_ec2.conf \
  --overwrite

Likewise if we wanted to run this pipeline locally we could substitute the clovr_assembly_velvet_ec2.conf file with the clovr_assembly_velvet_local.conf generated above.

After successfully executing this command we can monitor the pipeline in Ergatis, just as described in the 454 portion of this post, using the vp-describe-cluster command to grab the IP of our cluster. The only difference here is a requirement of our cluster name because we are running on EC2.

vp-describe-cluster --name illumina_ec2_cluster

MASTER ec2 e2-184-72-196-20.compute-1.amazonaws.com running
GANGLIA http://ec2-184-72-196-20.compute-1.amazonaws.com/ganglia
ERGATIS http://ec2-184-72-196-20.compute-1.amazonaws.com/ergatis

3.) Pipeline Output

When the pipeline has been completed pipeline output can be found in the /mnt/output directory.

Post-assembly

Although it wasn’t covered in this blog post, CloVR does support some basic visualization of assemblies through the AMOS packages hawkeye program. This functionality can be enabled by setting the SKIP_BANK parameter in the pipeline configuration to 1. Once this is enabled an AMOS bank file is generated and lumped into the output at the conclusion of the pipeline run.

Posted in Blog | Comments Off

CloVR interface demo video available

A video has been added to the CloVR Youtube Channel that demonstrates using CloVR from image startup to results download in 5 steps:

  1. Start CloVR Virtual Machine
  2. Add Data
  3. Add Credential Account
  4. Run Analysis
  5. Download Results
Check it out:

Posted in Blog | Comments Off

Cunningham & Autoscaling

Several CloVR tracks now utilize autoscaling in the cloud to improve performance. Efficient autoscaling can be difficult to achieve and requirements vary for each pipeline. If we underestimate the number of instances needed, the pipeline will take significantly longer. If we overestimate how many instances we need, then we end up wasting resources (and money in the case of Amazon EC2).

In our workflows, some form of BLAST often takes up the majority of cpu time. Fortunately this process can be parallelized fairly easily by partitioning the query dataset. But a few questions arise:

  • How much should we partition the query data?
  • How many parallelized jobs do we schedule?
  • How many instances do we need to request without being wasteful?

To help answer these questions in CloVR, we’ve developed a BLAST runtime estimator called Cunningham. Cunningham computes statistics about the shared sequence composition between a database and a query dataset in order to estimate how many cpu hours a corresponding BLAST job would take. With this information, each pipeline determines how many instances it will need overall and how to partition the data.

Currently, we support BLAST{N,P,X} runtime estimates against several well known databases including: NCBI COGs, eggNOG, KEGG genes, NCBI-NR, SILVA, & RefSeq (microbial genomes). Cunningham is lightweight and quite fast, typically requiring a few minutes to run. If you’re interested check out the Cunningham white paper in Nature Precedings.

The latest version of Cunningham is freely available through SourceForge.

Posted in Blog | Comments Off

CloVR On The DIAG

IGS has been working to get the DIAG up and running.  One portion of the DIAG is a cloud implementation using Nimbus.  This week we fired off our first big test on the DIAG, a pretty massive BLAST run.  The platform is still in testing but CloVR-Search ran just as expected.  The pipeline automatically scaled to the full 32 instances available.

DIAG is a free academic cluster that will include an EC2-like Cloud environment thanks to the Nimbus software.  The CloVR virtual machine is designed to be portable across clouds and has been run on Amazon EC2, Argonne Magellan, and the DIAG.

Here is a screen shot from Ganglia:

CloVR On Diag

CloVR Running A BLAST On The DIAG

Here is a screen shot showing the day view.  You can see the autoscaling, it is the red line in the top left graph.

CloVR on DIAG, day view

A day view of CloVR running on the DIAG

Posted in Blog | Comments Off

Screencast

A screencast using the command line interface to run a small metagenomics analysis for demonstration purposes. This mock demo was presented at Beyond the Genome Cloud Computing Workshop in Boston Oct, 2010.

(no audio for now)

Posted in Blog | Comments Off

On-demand 1280 CPU cluster using EC2

We are using Amazon EC2 to quickly deploy clusters and run searches. We’ve been stress testing this with CloVR over the past few months to make sure our platform scales out as expected. We’ve been very pleased with the results. In one test, we launched a cluster with 160 c1.xlarge instances to run a BLASTX search. This gave us 1280 CPUs for processing.

Here is a screenshot from Ganglia during the scale out

We stopped the pipeline early to save credits after gathering some stats. Scale down looked good too leaving just a master instance up at the end

We used rsync over HPN-SSH to transfer the NCBI nr database out to each instance. We’ve set this up to run peer-to-peer so that any instances can send a copy of the database once it is ready. Using this, we saw network throughputs top 1GB/sec on our cluster.

The graph shows the throughput step up as we additional instances came online. A single c1.xlarge instance has been giving us <30MB/sec.

One other interesting observation during our tests is that a single request can move the spot market price, at least for our tests on m1.xlarge in us-east coast. We’ve been using the Amazon spot market instances for testing since they are usually ~1/3 the price of on-demand instances (~$0.22-$0.25 versus on demand price of $0.68). During one of our tests in July, we were monitoring the market closely and submitted a single request for 150 m1.xlarge instances. We saw the price spike to $0.68 immediately after our request was submitted. Two days later we repeated the same experiment and got the same outcome.

The price dropped back to $0.23 after our run

Posted in Blog | Comments Off