Getting started
Although use of the Cloud for CloVR-Microbe as for all other CloVR pipelines is entirely optional, it is recommended for this pipeline, as several steps of the CloVR-Microbe pipeline are computationally extensive and local executions can be very time-consuming. The Velvet assembly step requires RAM in access of 4GB. BLAST and HMMer searches of protein sequences during the annotation process benefit significantly from parallelization across multiple processors on the Cloud.
If you want to use the Cloud to run CloVR-Microbe, you must obtain credentials from your Cloud provider and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed. See vp-terminate-cluster command below.
Input Data Set
Input: E.coli Illumina Paired End data
In preparation for the CloVR Microbe Illumina pipeline run that will be described below, the example data set should be downloaded and extracted to the shared folder in the extracted VM directory to allow easy access when working from within the CloVR VM.
After extracting the example data in the shared directory, you should be able to access two files on your vm:
/mnt/partial_reads_1.fastq /mnt/partial_reads_2.fastq
If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.
Tagging Input
To specify input for the Microbe Illumina pipeline, it must first be tagged. See the Configuration File section for more information on what types of input the pipeline can accept.
In our example above, there are two files. These files are paired-end short reads and can be tagged as follows:
vp-add-dataset -o --tag-name example_fastq \ /mnt/partial_reads_1.fastq \ /mnt/partial_reads_2.fastq
Configuration File
FILE: CloVR-Microbe-illumina config
A configuration file is used when running the Microbe Illumina pipeline to define parameters to the various components as well as define inputs, outputs, log files and many other options that can be fine-tuned to control the pipeline.
The configuration file detailed below can be found in the link above.
## Configuration file for clovr_microbe_illumina ######################################################### [input] # Short Paired Input Tag # The input tag describing any short paired end input reads (fasta or fastq SHORT_PAIRED_TAG=example_fastq # Long Paired Input Tag # The input tag describing any long paired end input reads (fasta or fastq LONG_PAIRED_TAG= # Short Reads Input Tag # The input tag describing any short non-paired end input reads (fasta or fastq SHORT_TAG= # Long Reads Input Tag # The input tag describing any long non-paired end input reads (fasta or fastq LONG_TAG=
The CloVR Microbe Illumina pipeline can take in any number of various files. These files need to be in fasta or fastq format. The values for each of the four options above should be the tag name of previously tagged datasets. For paired-end (SHORT_PAIRED_TAG and LONG_PAIRED_TAG) each set of paired end files should be tagged together. For example, if you had 2 sets of paired-end illumina reads, you would have 2 tags as values (separated by commas):
SHORT_PAIRED_TAG=tag1,tag2
[params] # Output prefix for the organism # Organisms have a prefix on them OUTPUT_PREFIX=asmbl # Organism # Genus and species of the organism. Must be two words in the form of: Genus species ORGANISM=Escherichia coli
An OUTPUT_PREFIX should be provided that will be used for in naming all intermediate and output files. An ORGANISM name must be provided for use when generating the output genbank files.
# Start hash length # The hash length velvet optimiser will start with. Must be an odd number, # less than end hash length and 19 < x < 31 START_HASH_LENGTH=19 # End hash length # The hash length velvet optimiser will end with. Must be an odd number, # greater than start hash length and 19 < x < 31 END_HASH_LENGTH=31 # VelvetG Options # Other options sent to velvetg. If using paired end reads, use AT LEAST # -ins_length and -ins_length_sd. -min_contig_lgth is already set. VELVETG_OPTS=-ins_length 300 -ins_length_sd 50
The remaining parameters control the assembly of the reads in the input files. Velvet uses a hash length parameter when looking at overlaps. For more information about this parameter, see the Velvet documentation. In this pipeline, were using Velvet optimiser which can use a range of hash lengths between the values of 19 and 31. For most cases, leaving START_HASH_LENGTH = 19 and END_HASH_LENGTH = 31 is the best case. If you know any other VelvetG options, you can also specify them in this config file. For example, if you are using paired end data, you should include the insert_length and insert length standard deviation as above.
[cluster] # Cluster Name # Cluster name to run this on, shouldn't need to specify manually CLUSTER_NAME=local
# Credential # Credential to use to make the cluster CLUSTER_CREDENTIAL=local
The cluster info section provides the information regarding which cluster to use if an existing cluster is running or what type of cluster should be created if a new cluster is necessary. Just as with the pipeline tag and input tag a cluster is provided with a unique identifier as defined by the CLUSTER_NAME option. Your total cluster size can be found by taking this number and adding the one master node to it. The CLUSTER_CREDENTIAL parameter should match the credential (specifically the contents of cred-name used to set up your EC2 credentials) created above.
[output] # Output Directory # Directory to download output to, this should be located # in /mnt somewhere OUTPUT_DIR=/mnt/output
Placement of output files from the pipeline can be controlled in the output info section of the configuration. Here the OUTPUT_DIR parameter can be set to anywhere within the VM to deposit files. It is recommended that this option be left as is to avoid complications that can occur if space runs out on the VM. The /mnt/ directory references the shared directory which should make use of the file system provided by whichever computer the VM is running off and will most likely contain more space than is alloted on the VM’s file system. If the output directory is changed to a location on the VM’s file system running out of space is a possibility.
There are other options present in the config file which do not need to be changed and are used internally to the pipeline.
Running and Monitoring the Microbe Illumina pipeline
Running the CloVR Microbe Illumina pipeline is as easy as executing the following from the command-line:
clovrMicrobe /mnt/clovr-microbe-illumina.config
This command will return a task id (such as runPipeline-1288298917.36.) We will use this task id later on to track the progress of the pipeline.
The clovrMicrobe command starts a pipeline which launches a cluster as specified by the parameters in the config file. It also starts the CloVR-Microbe Illumina pipeline. If you specified a CLUSTER_NAME as something other than ‘local’, a cluster will be automatically launched.
vp-describe-cluster --list
This command will return a list of all available clusters. Here is example output returned:
CLUSTER local CLUSTER clovr-microbe-illumina
Identify the cluster that the Illumina pipeline is running on and use the –name option with vp-describe-cluster:
vp-describe-cluster --name clovr-microbe-illumina MASTER i-6459fb09     some-instance.compute-1.amazonaws.com       running EXEC   i-94f855f9     some-exec.compute-1.amazonaws.com           running GANGLIA http://some-instance.compute-1.amazonaws.com/ganglia ERGATIS http://some-instance.compute-1.amazonaws.com/ergatis SSH    ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 \ -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null \ -q -i /mnt/keys/devel1.pem root@some-instance.compute-1.amazonaws.com
This tells you that there is one master node and one exec node in ‘running’ status. It also gives you links to Ganglia and Ergatis.
Output
The output for the pipeline will automatically be downloaded onto your local VM in the directory specified in the OUTPUT_DIR parameter.
The output for the CloVR Microbe Illumina pipeline will include the assembly scaffolds, polypeptide fasta, CDS fasta and annotation files (in genbank and sqn formats.)
Terminating a cluster
When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed. To terminate a cluster, enter you cluster name
vp-terminate-cluster --cluster=cluster_name
Interrupting a pipeline
If the execution of CloVR-Microbe is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop. It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.
Recovering from error and restarting the pipeline
If the execution of CloVR-Microbe fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.
# Name of pipeline PIPELINE_NAME=clovr_microbe_iilumina-2