CloVR Microbe Annotation Walkthrough

Getting Started

Use of the Cloud is entirely optional. If you want to use the Amazon Cloud, be sure to have configured your Amazon EC2 credentials.  Also, usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed.  See vp-terminate-cluster command below.

Input Data Set

FILE: Example Data Set

In preparation for the CloVR-Microbe Annotation pipeline run that will be described below, the example data set should be downloaded and extracted to the shared folder in the extracted VM directory to allow easy access when working from within the CloVR VM. From within the VM, the shared folder is accessible as /mnt/.

At this point, you should be able to access the following file from within your VM:

/mnt/clovr_acinetobacter_example.fsa

Tagging Input

To specify input for the pipeline, it must first be tagged. See the Configuration File section for more information on what types of input the pipeline can accept.

In our example above there is only one input FSA which can be tagged as follows:

vp-add-dataset -o --tag-name inputfastas /mnt/clovr_acinetobacter_example.fsa

Configuration File

FILE: CloVR-Microbe-Annotation Configuration File

A configuration file is used when running the microbe annotation pipeline to define parameters to the various components as well as define inputs, outputs, log files and many other options that can be fine-tuned to control the pipeline. The configuration file detailed below can be found in the link above.

[input]
INPUT_FSA_TAG=inputfastas

CloVR makes use of a tagging system in its pipelines for data being uploaded and data being downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process. In this pipeline, the input tag INPUT_FSA_TAG must match the tag you used with the vp-add-dataset command above.

## organism info
## Output prefix for the organism
OUTPUT_PREFIX=asmbl

## Genus and species of the organism, Must be two words in the form 'Genus species'
ORGANISM=

An OUTPUT_PREFIX should be provided that will be used for in naming all intermediate and output files. An ORGANISM name must be provided for use when generating the output Genbank files.

[cluster]
# Cluster Name
# Cluster name to run this on, shouldn't need to specify manually
CLUSTER_NAME=local
# Credential
# Credential to use to make the cluster
CLUSTER_CREDENTIAL=local

The cluster info section provides the information regarding which cluster to use if an existing cluster is running or what type of cluster should be created if a new cluster is necessary. Just as with the pipeline tag and input tags a cluster is provided with a unique identifier as defined by the CLUSTER_NAME option. The CLUSTER_CREDENTIAL parameter should match the Amazon EC2 credential created earlier (see Getting Started above).

[output]
# Output Directory
# Directory to download output to, this should be located
# in /mnt somewhere
OUTPUT_DIR=/mnt/output

Placement of output files from the pipeline can be controlled in the output info section of the configuration. Here the OUTPUT_DIRECTORY parameter can be set to anywhere within the VM to deposit files. It is recommended that this option be left as is to avoid complications that can occur if space runs out on the VM. The /mnt/ directory references the shared directory which should make use of the file system provided by whichever computer the VM is running off and will most likely contain more space than is alloted on the VM’s file system. If the output directory is changed to a location on the VM’s file system running out of space is a possibility.

[pipeline]
# Pipeline Name
# Name of pipeline
PIPELINE_NAME=ReplaceWithYourPipelineName

Each pipeline run requires a unique PIPELINE_NAME which will help identify the output of the pipeline later on. There are other options present in the config file which do not need to be changed and are used internally to the pipeline.

Running and Monitoring the Microbe Annotation pipeline

Running the CloVR-Microbe Annotation pipeline is as easy as executing the following from the command-line:

clovrMicrobe /mnt/clovr_microbe_annotation.config

This command automatically initiates a cluster, uploads the tagged data to that cluster, and starts the pipeline. To view a list of all available clusters you are running, execute:

 vp-describe-cluster --list

Here is example output returned:

CLUSTER local
CLUSTER clovr-microbe-anntotation

Identify the cluster that the pipeline is running on and use the –name option with vp-describe-cluster:

vp-describe-cluster --name clovr-microbe-annotation
MASTER  i-6459fb09      some-instance.compute-1.amazonaws.com        running
EXEC    i-94f855f9      some-exec.compute-1.amazonaws.com            running
GANGLIA http://some-instance.compute-1.amazonaws.com/ganglia
ERGATIS http://some-instance.compute-1.amazonaws.com/ergatis
SSH     ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 \
-o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null \
-q -i /mnt/keys/devel1.pem root@some-instance.compute-1.amazonaws.com

This tells you that there is one master node and one exec node in ‘running’ status. It also gives you links to Ganglia and Ergatis. Visiting the Ergatis link will give you an overview of the pipeline status and if any elements of the pipeline have failed. The Ganglia link will display the status of the cluster including number of nodes and processes, available memory, and data transfers over the network.

Output

The output for the pipeline will automatically be downloaded onto your local VM in the directory specified in the OUTPUT_DIR parameter.

The output for the CloVR-Microbe Annotation pipeline will include the polypeptide fasta, CDS fasta and annotation files (in genbank and sqn formats.)

Terminating a cluster

When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed.  To terminate a cluster, enter you cluster name

vp-terminate-cluster --cluster=cluster_name

Interrupting a pipeline

If the execution of CloVR-Microbe is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop.  It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.

Recovering from error and restarting the pipeline

If the execution of CloVR-Microbe fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.

# Name of pipeline
PIPELINE_NAME=clovr_microbe_annotation-2