CloVR 454 and Illumina assembly

Assembly of genomes in CloVR is currently handled via two separate pipelines; the clovr_assembly_celera pipeline operates on sequences generated via a 454 machine (FLX or Titanium) while the clovr_assembly_velvet pipeline handles sequences generated from Illumina GA machines. This blog post will detail the procedures required to run assembly using both of these pipelines. The two datasets used in the this demonstration follow:

Using the CloVR Dashboard to run assembly pipelines

The CloVR dashboard is a new graphical interface that makes running pipelines extremely easy. The dashboard can be found by navigating a web-browser to the CloVR VM’s ip address like so:

http://[yourclovrvmip]/clovr

Once loaded the page should look like the following:

This is the dashboards main page from which we can configure our pipeline. We can start by tagging our data set by hitting the Add button found in the lower left corner of the dashboard (Denoted in the figure by the number one). Clicking on the add button presents us with a new dialog window.

We can select data found under the /mnt/data/user_data/ folder here via the Select file from image button or can upload data from another location on our computer by using the Browse… button. Once a file has been uploaded it can be checked off in the right-hand pane of the dialog box. For this demonstration I will be running the 454 assembly pipeline so the filetype of the file is set to SFF via the drop down File Type menu. A name and description should also be given to the dataset. Once the proper data has been entered the dataset can be tagged by clicking the Tag button.

We should now see our dataset in the CloVR dashboard dataset left-hand pane and can proceed to configure our pipeline. We run the 454 assembly pipeline through the CloVR microbe pipeline:

The dataset we just tagged can be selected from a drop-down menu and we must select the Assembly Only option as we are just interested in assembly. Any other settings pipeline settings may also be changed at this point. Once satisfied we can hit theValidate button to ensure that our configuration is ok followed by the Submit button to submit the pipeline to run.

The pipeline steps can be monitored through the dashboard with more fine-grained messages concerning the status of each individual step available by clicking on a specific pipeline that is running.

Once the pipeline has complete successfully output can also be downloaded via the data sets pane on the left-hand side of the dashboard.

Command-line 454 Assembly in CloVR

Assembly of sequences generated via a 454 machine are handled via the clovr_assembly_celera pipeline. As the name indicates, this pipeline makes use of the celera assembler software (version 5.4) in assembling sequences.

To demonstrated assembly in CloVR I will be running this pipeline on a an E.coli data set containing 500,000 reads (the data set is included at the bottom of this blog post in the appendix) and executing the run on commodity home/office hardware:

Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz (2 Cores)
MemTotal:      3980400 kB (4GB)

1.) Preparing input and pipeline configuration

To start of with we’ll want to be logged into a running CloVR image. Datasets in CloVR must be ‘tagged’ before they are ready for use. In essence tagging a dataset is akin to associating a unique identifier to it which any CloVR component can then call upon to retrieve the data. Tagging is done via the vp-add-dataset command:

vp-add-dataset --tag-name=454_assembly_test /mnt/data/454_500k.sff -o

If the command has executed you’ll receive some status information regarding the tagging process:

Task: tagData-1296774703.25 Type: tagData       State: completed     
Num: 1/1 (100%) LastUpdated: 2011/02/03 23:11:43 UTC

With our data tagged we can move onto configuring our pipeline through the use of a configuration file generated by the vp-describe-protocols command:

vp-describe-protocols -p clovr_assembly_celera \
  -c input.INPUT_SFF_TAG=454_assembly_test \
  -c params.OUTPUT_PREFIX=454_500k \
  -c cluster.CLUSTER_NAME=local \
  -c cluster.CLUSTER_CREDENTIAL=local \
  -c cluster.TERMINATE_ONFINISH=false \
  -c pipeline.PIPELINE_DESC="CloVR 454 assembly test" \
   > /mnt/data/clovr_assembly_celera.conf

The output of this command is a configuration file for the CloVR 454 assembly pipeline configured to use the dataset we tagged above. Options are configured using the -c flag and must match a parameter found in the configuration file. Some common parameters are described below; please note that not all of these parameters are required to run the 454 assembly pipeline.

  • input.INPUT_SFF_TAG - Remember our first step where we tagged our input file (I named mine 454_assembly_test)? That value is entered under this option.
  • params.OUTPUT_PREFIX – The output prefix is used when naming files generated by the assembly pipeline. It is practical here to use the organism name or some other descriptive text such as the ecoli500k value I will be using.
  • params.TRIM – This option controls how stringently we ignore low-quality base pairs past the trim points set by the 454 instrument. The recommended value to use here is chop, which does not use any base pairs past the trim points.
  • params.CLEAR – Controls where the clear range (or trim points) are specified for our input sequence data. 454 is the recommended setting here
  • params.LINKER – The linker sequence to use. Set to either flx or titanium depending on the instrument used to generate sequence.
  • params.INSERT_SIZE – This option should only be used when dealing with paired end datasets, which this dataset is. The format for this option is ‘i d’ where mates are on average i +/- d base pairs apart. Our example dataset will use the value ‘8000 1000
  • cluster.CLUSTER_NAME – Set to local if this pipeline is being run locally (on a single machine or a local grid) or to the name of a pipeline created using the vp-start-cluster command. We’re running this pipeline on our desktop so I will keep this parameter set to the value of local
  • cluster.CLUSTER_CREDENTIAL – This parameter should be set to local if this pipeline is being run on a local machine/grid and the value of our vp-add-credential command. Keeping this one on local to indicate a local run.
  • cluster.EXEC_NODES – This command controls the number of nodes brought up in our cluster. If this is a local run it is recommended that this number not exceed the number of CPU cores available. If being run on EC2 this will control how many EC2 instances are brought up. For this assembly example I will have this value set to 0 as we are running on one machine.

2.) Running the pipeline

Once all our configuration parameters have been set to the satisfactory options we can start our pipeline by invoking the vp-run-pipeline command:

vp-run-pipeline --print-task-name \
       --pipeline-config /mnt/data/454_blog/clovr_assembly_celera.conf \
       --overwrite

If the command is successfully executed a task number should be returned which can be used to monitor the overall status of the pipeline.  We can view our progress of the pipeline using the Ergatis interface that is installed on the CloVR VM. Ergatis display our pipeline in a block-like visualization with each analysis step displayed as a component of the pipeline. In order to view Ergatis we must first obtain the IP address for our cluster using the vp-describe-cluster command:

vp-describe-cluster

STATE   running
MASTER  local   clovr-10-90-135-181     running
GANGLIA http://clovr-10-90-135-181/ganglia
ERGATIS http://clovr-10-90-135-181/ergatis

The URL to the Ergatis interface running off of our image will be listed in the output to the vp-describe-cluster command under the ERGATIS parmeter. Taking this and tossing it into a web-browser should take us to the following page:

From here clicking on the clovr link on the left hand navigation will take us to a list of all our current running pipelines. The pipeline we just started should be here and viewable in the component-by-component view Ergatis is capable of displaying.

It is also possible to monitor cluster usage during a pipeline run by using the URL provided in the GANGLIA parameter from the vp-describe-cluster output:

3.) Pipeline Output

Once our assembly pipeline run has finished CloVR automatically downloads any output back to our VM and deposits it to the output directory defined in the pipeline configuration file. Navigating to this directory should provide a tarball file containing our assembly scaffolds as well as a QC file containing metrics for the assembly (i.e. N50).

Command-line Illumina Assembly in CloVR

Illumina assembly in CloVR follows the same workflow as the 454 assembly did save for running the data through a separate pipeline and having to run the pipeline on Amazon’s EC2 compute cloud.

1.) Preparing input and pipeline configuration

This example will make use of an input data set of 4 million paired-end reads (provided in appendix at the bottom of this post) that must be tagged as input just like the 454 example data set.

vp-add-dataset --tag-name=ecoli_4M_fastq \
                 /mnt/data/illumina_4M_1.fastq \
                 /mnt/data/illumina_4M_2.fastq -o

Before moving onto generation and modification of the pipeline configuration file it should be noted that due to the size our input dataset it will require a machine with at least 16GB of RAM. During peak processing of this dataset our RAM usage will eclipse 11GB and this will lock up machines that do not have this required amount of memory. Because of the memory situation associated with this dataset we will be making use of Amazon’s EC2 compute cloud and the m2.xlarge instance type containing 16GB of memory.

In order to do this we must generate an EC2 cluster credential using the vp-add-credential command coupled with the private key and certificate provided with an Amazon AWS account. Instructions to register for an Amazon AWS account and place the subsequent keys can be found here.

Once our keys are in place we can create our cluster credential by using the following command:

vp-add-credential --cred-name illumina_ec2 --cert /mnt/keys/ec2.cert \
                  --pkey /mnt/keys/ec2.pkey

With our credential create we can move onto generating our pipeline config by running the vp-describe-protocols command with the key difference being that our pipeline name is clovr_assembly_velvet:

vp-describe-protocols -p clovr_assembly_velvet \
       -c input.SHORT_PAIRED_TAG=ecoli_4M_fastq \
       -c cluster.CLUSTER_NAME=illumina_ec2 \
       -c cluster.CLUSTER_CREDENTIAL=illumina_ec2 \
       -c cluster.MASTER_INSTANCE_TYPE="large" \
       -c cluster.TERMINATE_ONFINISH=false \
       -c pipeline.PIPELINE_DESC="CloVR illumina test run" \
       > /mnt/data/clovr_assembly_velvet_ec2.conf

An explanation of the parameters used follows:

  • input.SHORT_PAIRED_TAG – This parameter tells the pipeline where to find the dataset we created when running the vp-add-dataset command. It should be set to whatever was specified in the tag-name flag and for this example will be set to illumina_assembly_test.
  • cluster.CLUSTER_NAME – A unique name given to our cluster. If a local cluster is being run this should be left as the default value of local. For our example I will set this to illumina_ec2_cluster.
  • cluster.CLUSTER_CREDENTIAL – Here we will use the credential we generated above, illumina_ec2
  • cluster.MASTER_INSTANCE_TYPE – This parameter controls the type of master node used in an EC2 cluster. Because this Illumina run requires a high memory machine we set the parameter to large to indicate we want to use a machine with 16GB of memory.

If we had a local machine with 16GB of memory or more upon which we could run this pipeline on our call to vp-describe-protocols would look like this:

vp-describe-protocols -p clovr_assembly_velvet \
       -c input.SHORT_PAIRED_TAG=ecoli_4M_fastq \
       -c cluster.CLUSTER_NAME=local \
       -c cluster.CLUSTER_CREDENTIAL=local \
       -c cluster.TERMINATE_ONFINISH=false \
       -c pipeline.PIPELINE_DESC="CloVR illumina test run" \
       > /mnt/data/clovr_assembly_velvet_local.conf

The three parameters different between this and our EC2 run being cluster.CLUSTER_NAME, cluster.CLUSTER_CREDENTIAL, and cluster.MASTER_INSTANCE_TYPE. Here we do not need a master instance type and our cluster name and cluster credential need to be set to local.

2.) Running the pipeline

With our configuration file in place we can go ahead and launch our pipeline using the vp-run-pipeline command:

vp-run-pipeline --print-task-name \
  --pipeline-config /mnt/data/illumina_ec2_blog/clovr_assembly_velvet_ec2.conf \
  --overwrite

Likewise if we wanted to run this pipeline locally we could substitute the clovr_assembly_velvet_ec2.conf file with the clovr_assembly_velvet_local.conf generated above.

After successfully executing this command we can monitor the pipeline in Ergatis, just as described in the 454 portion of this post, using the vp-describe-cluster command to grab the IP of our cluster. The only difference here is a requirement of our cluster name because we are running on EC2.

vp-describe-cluster --name illumina_ec2_cluster

MASTER ec2 e2-184-72-196-20.compute-1.amazonaws.com running
GANGLIA http://ec2-184-72-196-20.compute-1.amazonaws.com/ganglia
ERGATIS http://ec2-184-72-196-20.compute-1.amazonaws.com/ergatis

3.) Pipeline Output

When the pipeline has been completed pipeline output can be found in the /mnt/output directory.

Post-assembly

Although it wasn’t covered in this blog post, CloVR does support some basic visualization of assemblies through the AMOS packages hawkeye program. This functionality can be enabled by setting the SKIP_BANK parameter in the pipeline configuration to 1. Once this is enabled an AMOS bank file is generated and lumped into the output at the conclusion of the pipeline run.

This entry was posted in Blog. Bookmark the permalink.