Getting Started
The CloVR-Metagenomics pipeline provides a robust comparative metagenomics workflow, complete with cluster auto-scaling and parallelization.
Although use of the Cloud for CloVR-Metagenomics as for all other CloVR pipelines is entirely optional, it is recommended for this pipeline, as local executions can be very time-consuming. Especially the BLAST search steps of the CloVR-Metagenomics pipeline are computationally extensive and benefit from parallelization across multiple processors on the Cloud.
If you want to use the Cloud to run CloVR-Metagenomcis, you must obtain credentials from your Cloud provider and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed. See vp-terminate-cluster command below.
Inputs
Multiple fasta files (1 file per sample) & a CloVR-formatted mapping file
Download a Dataset
FILE: CloVR-Metagenomics mini example dataset
In preparation for the CloVR-Metagenomics pipeline run that will be described below, downloand extract the ts*.small.fasta dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/.
If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.
Runtime Estimation
Before running CloVR-Metagenomics on your dataset, you may want to get an idea of how many cpu hours you expect the run to take. CloVR actually does this within the pipeline using a tool called cunningham which is a new BLAST runtime estimator. Cunningham utilizes kmer frequency statistics about a given database and the query dataset in order to provide the total expected number of cpu hours a BLAST run will take in the cloud. In the CloVR-Metagenomics pipeline, there are two BLAST runs: one is a BLASTN against the NCBI RefSeq microbial genomes database and the other is a BLASTX against a protein database such as NCBI COGs (default), eggNOG, or KEGG. Using cunningham, you can get an initial idea of how many cpu hours the pipeline will require, and therefore how much a run may cost. Cunningham is available via the command line to get an estimate. For example, in the mini dataset presented above, suppose we first concatenate all of the fasta files into a single file e.g. all.seqs.
cat /mnt/ts*.fasta > /mnt/all.seqs
To get the options for the cunningham just run:
cunningham
We’d like to know how long the BLAST runs will take for these sequences, so we’ll first execute:
$> cunningham -Q /mnt/all.seqs -P blastn -D clovr-refseqdb Checking file for residues...nucleotides found... Total query size (residues): 259045 Total number of sequences: 1800 Computing input query kmer frequency profile... Number of seed match pairs: 727770700 Throughput (residues per hour): 259045 Runtime estimate: 1.00 cpu hours (0.04 cpu days, 0.00 cpu lifetimes)
From the output, we can see that cunningham estimates a BLASTN run of the sequences against the RefSeq db will require about 1 cpu hour, which is not much at all, and likely can be run locally. Next we’ll estimate how many cpu hours a BLASTX against the COG database could require.
$> cunningham -Q /mnt/all.seqs -P blastx -D clovr-cogdb Checking file for residues...nucleotides found... Total query size (residues): 259045 Total number of sequences: 1800 Computing input query kmer frequency profile... Number of seed match pairs: 3802683532 Throughput (residues per hour): 259045 Runtime estimate: 1.00 cpu hours (0.04 cpu days, 0.00 cpu lifetimes)
We see from the output that this run would also require ~ 1 cpu hour. Therefore, you could easily run this pipeline locally as opposed to the cloud. If you decided to use the cloud here, it would be very economical.
Disclaimer: These are only approximations of BLAST runtime, and may not be accurate for some datasets. Additionally, other parts of the pipeline can still take a significant amount of time depending on the dataset and metadata provided.
Pipeline Execution
1. Tagging data
Before starting a pipeline, you must first tag your data. For this pipeline, you need to tag the fasta and metadata files appropriately. This first command will tag the associated fasta files:
vp-add-dataset -o --tag-name=clovr_metagenomics_fasta /mnt/ts*.fasta
This next command will tag the mapping file:
vp-add-dataset -o --tag-name=clovr_metagenomics_map /mnt/Twins.small.meta
Note you have to provide a tag-name in each command. You will need these names when you edit the pipeline configuration file below.
2. Editing the configuration file
FILE: CloVR-Metagenomics configuration file
Use the configuration file to define parameters to the various components of CloVR-Metagenomics, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. Copy the configuration file into the same shared folder as the input file and access it in the /mnt/ directory.
The configuration file detailed below can be found in the link above.
[input] # Input fasta tag FASTA_TAG=clovr_metagenomics_fasta # Mapping tag for pipleline MAPPING_TAG=clovr_metagenomics_map # Functional (protein) database tag PROTEIN_DB_TAG=clovr-cogdb # Taxonomic (nucleotide) database tag NUCLEOTIDE_DB_TAG=clovr-refseqdb
CloVR makes use of a tagging system to tag pipelines, data being uploaded and data being downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process. In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above. Additionally, databases to be uploaded for the analysis are also tagged (e.g. COG, RefSeq, KEGG). For the time being, these tags should not be altered.
[cluster] # Cluster name CLUSTER_NAME=local
# Credential to use to make the cluster CLUSTER_CREDENTIAL=local
The cluster section determines the type of cluster, which is used by the CloVR-Metagenomics pipeline. This can either be an existing cluster, which is already running or a new cluster, which has to be created by the pipeline. Similarly to the pipeline and input tags, a cluster is assigned to a unique identifier as defined by the CLUSTER_NAME variable. If CloVR-Metagenomics is run locally, CLUSTER_CREDENTIAL should be changed to “localâ€. If you have set up a different credential (see Getting Started section above), you may set it here.
[pipeline] # Pipeline Name PIPELINE_NAME=ReplaceThisWithYourPipelineName # Pipeline Description PIPELINE_DESC=
Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter. The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.
3. Running the CloVR-Metagenomics pipeline
Now that your config file is ready, running the CloVR-Metagenomics pipeline is as easy as executing the following from the command-line:
clovrMetagenomics /mnt/clovr_metagenomics_noorfs_example.config &
The clovrMetagenomics command launches a cluster as specified by parameters in the config file and starts the workflow.
4. Monitoring your pipeline
The pipeline status can be monitored by navigating a browser to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM, which can be obtained with the script:
vp-describe-cluster --name=<CLUSTER_NAME>
where CLUSTER_NAME is specified in the config file. For example, if CLUSTER_NAME=my_cluster and you’ve started an ec2 cluster, your output will resemble:
[master <clovr_ip>]$ vp-describe-cluster --name=my_cluster MASTER i-571c113d     ec2-72-44-39-80.compute-1.amazonaws.com running GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis SSH    ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem root@ec2-72-44-39-80.compute-1.amazonaws.com
To monitor the status of your pipeline, navigate to the Ergatis and Ganglia links in the output. If you’re not sure about the name of your cluster, use the following command to obtain a list of all active clusters:
vp-describe-cluster --list
Downloading Output
OUTPUT TARBALL: CloVR-Metagenomics output
Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:
[output] OUTPUT_DIR=/mnt/output
Navigating to this directory we should find a tarball file containing the results of the pipeline run which can be extracted using tar (in Unix) or utility such as WinZip or WinRAR (in Windows).
The CloVR-Metagenomics pipeline outputs several different files for the user:
Output | Description |
---|---|
read_mapping | A text file displaying the one-to-one mapping of sequence names created in the pipeline. |
uclust_clusters | Raw text output from uclust runs. |
artificial_replicates | A list of read names wthat were found to be artificial replicates from the 454 platform. |
blast_functional | Raw output of blast hits of representative sequences to a functional DB. |
tables_functional | Summary tables of functional categories for each sample. |
piecharts_functional | Visualized piecharts for functional groups. |
skiff_functional | Output of skiff clusterings for different functional levels. |
metastats_functional | Output of Metastats analysis comparing subject groups or samples at different functional levels. |
histograms_functional | Visualized stacked histograms of functional annotations. |
blast_taxonomy | Raw output of blast hits of representative sequences to a taxonomic DB. |
tables_taxonomy | Summary tables of taxonomy groups for each sample. |
piecharts_taxonomy | Visualized piecharts for taxonomic groups. |
skiff_taxonomy | Output of skiff clusterings for different taxonomic levels. |
metastats_taxonomy | Output of Metastats analysis comparing subject groups or samples at different taxonomic levels. |
histograms_taxonomy | Visualized stacked histograms of taxonomic annotations. |
5. Terminating a cluster
When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed. To terminate a cluster, enter your cluster name
vp-terminate-cluster --cluster=cluster_name
Interrupting a pipeline
If the execution of CloVR-Metagenomics is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop. It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.
Recovering from error and restarting the pipeline
If the execution of CloVR-Metagenomics fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.
# Name of pipeline PIPELINE_NAME=clovr_metagenomics_pipeline-2