CloVR-Microbe Walkthrough (command-line)

Introduction

To run CloVR-Microbe version 1.0 cloud support is recommended. Cloud use is optional and CloVR-Microbe as any other CloVR protocol can be run on a local computer. However, several steps of the protocol either have high RAM requirements (assembly) or are computationally extensive (annotation), which can make local execution practically impossible due to long runtimes (e.g. weeks for a standard E. coli genome project). BLAST and HMMer searches that are part of the annotation component of CloVR-Microbe benefit significantly from parallelization across multiple processors on the Cloud. Assembly of 454 sequence data with the Celera assembler requires 4GB, whereas RAM in access of 4GB is required for the assembly of Illumina sequence data with Velvet.

Total runtimes of the CloVR-Microbe depend mostly on the number of predicted protein-coding genes that require functional annotation, which can vary significantly based on the assembly output. A typical run of CloVR-Microbe (e.g. a standard E. coli genome consisting of <100 contigs and a total length of <5 Mbp), independently of the type of input sequence data, usually finishes in under 24 hours. See our publication in PLoS ONE for details.

To use the cloud to run CloVR protocols, you must obtain credentials from one of thesupported cloud providers and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed. See vp-terminate-cluster command below.

 

Download Test Datasets and Output

Test datasets are .tar archives of SFF and FASTQ files and need to be extracted before use. Extracted datasets should be copied to the user_data folder in the CloVR VM. From within the VM, the shared folder is accessible as /mnt/user_data

 

At this point, you should be able to access the following file from within your VM:

/mnt/clovr_acinetobacter_example.sff

Or:

/mnt/partial_reads_1.fastq
/mnt/partial_reads_2.fastq

If you are using Virtual Box and are having problems accessing your data in the shared folder, check “Troubleshooting on Virtual Box“.

Tagging Input

To specify input for the pipeline, it must first be tagged. Note that multiple files can be tagged at once. See the Configuration File section for more information on what types of input the pipeline can accept.

In our 454 example, there is only one input SFF file, which can be tagged as follows:

vp-add-dataset -o --tag-name acinetobacter_sff /mnt/clovr_acinetobacter_example.sff

In our Illumina example, there are two FASTQ files. These files are paired-end short reads and can be tagged as follows:

vp-add-dataset -o --tag-name example_fastq \
/mnt/partial_reads_1.fastq \
/mnt/partial_reads_2.fastq

In our annotation-only example, there is only one input FASTA file which can be tagged as follows:

vp-add-dataset -o --tag-name inputfastas /mnt/clovr_acinetobacter_example.fsa

Configuration File

454 example run:

FILE: CloVR-Microbe454 configuration file

A configuration file is used when running the Microbe454 pipeline to define parameters to the various components as well as define inputs, outputs, log files and many other options that can be fine-tuned to control the pipeline. The configuration file detailed below can be found in the link above.

## Configuration file for clovr_microbe_454
#########################################################
## Input information.
## Configuration options for the pipeline.
#########################################################
[input]
# Input Tag
# The input tag for this pipeline
INPUT_SFF_TAG=acinetobacter_sff

CloVR makes use of a tagging system in its pipelines for data being uploaded and data being downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process. In this pipeline, the input tag INPUT_SFF_TAG must match the tag you used with the vp-add-dataset command above.

[params]
# Output prefix for the organism
# Organisms have a prefix on them
OUTPUT_PREFIX=asmbl

# Organism
# Genus and species of the organism.  Must be two words in the form of: Genus species
ORGANISM=Acinetobacter baylii

An OUTPUT_PREFIX should be provided that will be used for in naming all intermediate and output files. An ORGANISM name must be provided for use when generating the output Genbank files.

## sff_to_CA options
##
## trim can be one of the following values:
## none, soft, hard, chop
TRIM=chop 

##
## clear can be one of the following values:
## all, 454, none, n, pair-of-n, discard-n
CLEAR=454

# Possible values: titanium flx
LINKER=titanium

# Insert size, must be two numbers separated by a space (ex '8000 1000')
INSERT_SIZE=8000 1000

The remaining parameters control the assembly of the reads in the SFF file. The TRIM and CLEAR parameters control what portions of the reads are considered biologically relevant. Both parameters are used in denoting what portions of the reads are technical and therefore should not be included in the assembly. LINKER AND INSERT_SIZE are utilized when the dealing with a 454 paired end run. A Titanium or FLX linker can be specified alongside the insert size between mates (on average i +- d bp apart).

## celera assembler options
SPEC_FILE=/dev/null
SKIP_BANK=0

A valid celera assembler spec file can be provided to the pipeline containing additional configuration parameters for use by the assembly software and the SKIP_BANK flag can be set to 1 to also generate an AMOS Bank file that can be viewed in the visualization software Hawkeye.
Illumina example run:
FILECloVR-Microbe-illumina config

A configuration file is used when running the Microbe Illumina pipeline to define parameters to the various components as well as define inputs, outputs, log files and many other options that can be fine-tuned to control the pipeline.

The configuration file detailed below can be found in the link above.

## Configuration file for clovr_microbe_illumina
#########################################################
[input]
# Short Paired Input Tag
# The input tag describing any short paired end input reads (fasta or fastq
SHORT_PAIRED_TAG=example_fastq

# Long Paired Input Tag
# The input tag describing any long paired end input reads (fasta or fastq
LONG_PAIRED_TAG=

# Short Reads Input Tag
# The input tag describing any short non-paired end input reads (fasta or fastq
SHORT_TAG=

# Long Reads Input Tag
# The input tag describing any long non-paired end input reads (fasta or fastq
LONG_TAG=

The CloVR Microbe Illumina pipeline can take in any number of various files. These files need to be in fasta or fastq format. The values for each of the four options above should be the tag name of previously tagged datasets. For paired-end (SHORT_PAIRED_TAGand LONG_PAIRED_TAG) each set of paired end files should be tagged together. For example, if you had 2 sets of paired-end illumina reads, you would have 2 tags as values (separated by commas):

 SHORT_PAIRED_TAG=tag1,tag2
[params]
# Output prefix for the organism
# Organisms have a prefix on them
OUTPUT_PREFIX=asmbl

# Organism
# Genus and species of the organism.  Must be two words in the form of: Genus species
ORGANISM=Escherichia coli

An OUTPUT_PREFIX should be provided that will be used for in naming all intermediate and output files. An ORGANISM name must be provided for use when generating the output genbank files.

# Start hash length
# The hash length velvet optimiser will start with. Must be an odd number,
# less than end hash length and 19 < x < 31
START_HASH_LENGTH=19

# End hash length
# The hash length velvet optimiser will end with. Must be an odd number,
# greater than start hash length and 19 < x < 31
END_HASH_LENGTH=31

# VelvetG Options
# Other options sent to velvetg. If using paired end reads, use AT LEAST
# -ins_length and -ins_length_sd. -min_contig_lgth is already set.
VELVETG_OPTS=-ins_length 300 -ins_length_sd 50

The remaining parameters control the assembly of the reads in the input files. Velvet uses a hash length parameter when looking at overlaps. For more information about this parameter, see the Velvet documentation. In this pipeline, were using Velvet optimiser which can use a range of hash lengths between the values of 19 and 31. For most cases, leaving START_HASH_LENGTH = 19 and END_HASH_LENGTH = 31 is the best case. If you know any other VelvetG options, you can also specify them in this config file. For example, if you are using paired end data, you should include the insert_length and insert length standard deviation as above.
Annotation-only run:
FILECloVR-Microbe, annotation-only config

The configuration specifies the input data as follows:

[input]
INPUT_FSA_TAG=inputfastas

Additional configurations for the annotation-only component of CloVR-Microbe are identical to those from the full CloVR-Microbe pipelines.

General pipeline configurations:

[cluster]
# Cluster Name
# Cluster name to run this on, shouldn't need to specify manually
CLUSTER_NAME=local
# Credential
# Credential to use to make the cluster
CLUSTER_CREDENTIAL=local

The cluster info section provides the information regarding which cluster to use if an existing cluster is running or what type of cluster should be created if a new cluster is necessary. Just as with the pipeline tag and input tags a cluster is provided with a unique identifier as defined by the CLUSTER_NAME option.  The CLUSTER_CREDENTIAL parameter should match the Amazon EC2 credential created earlier (see Getting Started above).

[output]
# Output Directory
# Directory to download output to, this should be located
# in /mnt somewhere
OUTPUT_DIR=/mnt/output

Placement of output files from the pipeline can be controlled in the output info section of the configuration. Here the OUTPUT_DIR parameter can be set to anywhere within the VM to deposit files. It is recommended that this option be left as is to avoid complications that can occur if space runs out on the VM. The /mnt/ directory references the shared directory which should make use of the file system provided by whichever computer the VM is running off and will most likely contain more space than is alloted on the VM’s file system. If the output directory is changed to a location on the VM’s file system running out of space is a possibility.

[pipeline]
# Pipeline Name
PIPELINE_NAME=ReplaceWithYourPipelineName

Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important to modify if multiple pipelines are running on the same cluster.

There are other options present in the config file which do not need to be changed and are used internally to the pipeline.

Running and Monitoring the pipeline

Running the CloVR-Microbe pipeline is as easy as executing the following from the command-line:

clovrMicrobe /mnt/clovr-microbe.config

This command automatically initiates a cluster, uploads the tagged data to that cluster, and starts the pipeline. It will also return a task id (such as runPipeline-1288298917.36.) We will use this task id later on to track the progress of the pipeline.

To view a list of all available clusters you are running, execute:

 vp-describe-cluster --list

Here is example output returned:

CLUSTER local
CLUSTER clovr-microbe-cluster

Identify the cluster that the pipeline is running on and use the –name option with vp-describe-cluster:

vp-describe-cluster --name clovr-microbe-cluster
MASTER  i-6459fb09      some-instance.compute-1.amazonaws.com        running
EXEC    i-94f855f9      some-exec.compute-1.amazonaws.com            running
GANGLIA http://some-instance.compute-1.amazonaws.com/ganglia
ERGATIS http://some-instance.compute-1.amazonaws.com/ergatis
SSH     ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 \
-o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null \
-q -i /mnt/keys/devel1.pem root@some-instance.compute-1.amazonaws.com

This tells you that there is one master node and one exec node in ‘running’ status. It also gives you links to Ganglia and Ergatis. Visiting the Ergatis link will give you an overview of the pipeline status and if any elements of the pipeline have failed. The Ganglia link will display the status of the cluster including number of nodes and processes, available memory, and data transfers over the network.

Output

The output for the pipeline will automatically be downloaded onto your local VM in the directory specified in the OUTPUT_DIR parameter.

The output for the CloVR-Microbe pipeline will include the assembly scaffolds, assembly qc file, polypeptide fasta, CDS fasta and annotation files (in genbank and sqn formats.)

Terminating a cluster

When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed.  To terminate a cluster, enter you cluster name

vp-terminate-cluster --cluster=cluster_name 

Interrupting a pipeline

If the execution of CloVR-Microbe is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop.  It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.

Recovering from error and restarting the pipeline

If the execution of CloVR-Microbe fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.

# Name of pipeline
PIPELINE_NAME=clovr_microbe-2