CloVR-Metagenomics v1.0 Walkthrough (command-line)

Getting Started

The CloVR-Metagenomics pipeline provides a robust comparative metagenomics workflow, complete with cluster auto-scaling and parallelization.

Although use of the Cloud for CloVR-Metagenomics as for all other CloVR pipelines is entirely optional, it is recommended for this pipeline, as local executions can be very time-consuming. Especially the BLAST search steps of the CloVR-Metagenomics pipeline are computationally extensive and benefit from parallelization across multiple processors on the Cloud.

If you want to use the Cloud to run CloVR-Metagenomcis, you must obtain credentials from your Cloud provider and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed.  See vp-terminate-cluster command below.

Inputs

Multiple fasta files (1 file per sample) & a CloVR-formatted mapping file

Download a Dataset

FILE: CloVR-Metagenomics mini example dataset

In preparation for the CloVR-Metagenomics pipeline run that will be described below, downloand extract the ts*.small.fasta dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/.

If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.

Runtime Estimation

Before running CloVR-Metagenomics on your dataset, you may want to get an idea of how many cpu hours you expect the run to take. CloVR actually does this within the pipeline using a tool called cunningham which is a new BLAST runtime estimator. Cunningham utilizes kmer frequency statistics about a given database and the query dataset in order to provide the total expected number of cpu hours a BLAST run will take in the cloud. In the CloVR-Metagenomics pipeline, there are two BLAST runs: one is a BLASTN against the NCBI RefSeq microbial genomes database and the other is a BLASTX against a protein database such as NCBI COGs (default), eggNOG, or KEGG. Using cunningham, you can get an initial idea of how many cpu hours the pipeline will require, and therefore how much a run may cost. Cunningham is available via the command line to get an estimate. For example, in the mini dataset presented above, suppose we first concatenate all of the fasta files into a single file e.g. all.seqs.

cat /mnt/ts*.fasta > /mnt/all.seqs

To get the options for the cunningham just run:

cunningham

We’d like to know how long the BLAST runs will take for these sequences, so we’ll first execute:

$> cunningham -Q /mnt/all.seqs -P blastn -D clovr-refseqdb
Checking file for residues...nucleotides found...
Total query size (residues): 259045
Total number of sequences: 1800
Computing input query kmer frequency profile...

Number of seed match pairs: 727770700
Throughput (residues per hour): 259045
Runtime estimate: 1.00 cpu hours (0.04 cpu days, 0.00 cpu lifetimes)

From the output, we can see that cunningham estimates a BLASTN run of the sequences against the RefSeq db will require about 1 cpu hour, which is not much at all, and likely can be run locally. Next we’ll estimate how many cpu hours a BLASTX against the COG database could require.

$> cunningham -Q /mnt/all.seqs -P blastx -D clovr-cogdb
Checking file for residues...nucleotides found...
Total query size (residues): 259045
Total number of sequences: 1800
Computing input query kmer frequency profile...

Number of seed match pairs: 3802683532
Throughput (residues per hour): 259045
Runtime estimate: 1.00 cpu hours (0.04 cpu days, 0.00 cpu lifetimes)

We see from the output that this run would also require ~ 1 cpu hour. Therefore, you could easily run this pipeline locally as opposed to the cloud. If you decided to use the cloud here, it would be very economical.

Disclaimer: These are only approximations of BLAST runtime, and may not be accurate for some datasets. Additionally, other parts of the pipeline can still take a significant amount of time depending on the dataset and metadata provided.

 

Pipeline Execution

1. Tagging data

Before starting a pipeline, you must first tag your data. For this pipeline, you need to tag the fasta and metadata files appropriately. This first command will tag the associated fasta files:

vp-add-dataset -o --tag-name=clovr_metagenomics_fasta /mnt/ts*.fasta

This next command will tag the mapping file:

vp-add-dataset -o --tag-name=clovr_metagenomics_map /mnt/Twins.small.meta

Note you have to provide a tag-name in each command. You will need these names when you edit the pipeline configuration file below.

2. Editing the configuration file

FILE: CloVR-Metagenomics configuration file

Use the configuration file to define parameters to the various components of CloVR-Metagenomics, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. Copy the configuration file into the same shared folder as the input file and access it in the /mnt/ directory.

The configuration file detailed below can be found in the link above.

[input]
# Input fasta tag
FASTA_TAG=clovr_metagenomics_fasta
# Mapping tag for pipleline
MAPPING_TAG=clovr_metagenomics_map
# Functional (protein) database tag
PROTEIN_DB_TAG=clovr-cogdb
# Taxonomic (nucleotide) database tag
NUCLEOTIDE_DB_TAG=clovr-refseqdb

CloVR makes use of a tagging system to tag pipelines, data being uploaded and data being downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process. In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above. Additionally, databases to be uploaded for the analysis are also tagged (e.g. COG, RefSeq, KEGG). For the time being, these tags should not be altered.

[cluster]
# Cluster name
CLUSTER_NAME=local
# Credential to use to make the cluster
CLUSTER_CREDENTIAL=local

The cluster section determines the type of cluster, which is used by the CloVR-Metagenomics pipeline. This can either be an existing cluster, which is already running or a new cluster, which has to be created by the pipeline. Similarly to the pipeline and input tags, a cluster is assigned to a unique identifier as defined by the CLUSTER_NAME variable. If CloVR-Metagenomics is run locally, CLUSTER_CREDENTIAL should be changed to “local”. If you have set up a different credential (see Getting Started section above), you may set it here.

[pipeline]
# Pipeline Name
PIPELINE_NAME=ReplaceThisWithYourPipelineName

# Pipeline Description
PIPELINE_DESC=

Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter. The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.

 

3. Running the CloVR-Metagenomics pipeline

Now that your config file is ready, running the CloVR-Metagenomics pipeline is as easy as executing the following from the command-line:

clovrMetagenomics /mnt/clovr_metagenomics_noorfs_example.config &

The clovrMetagenomics command launches a cluster as specified by parameters in the config file and starts the workflow.

4. Monitoring your pipeline

The pipeline status can be monitored by navigating a browser to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM, which can be obtained with the script:

vp-describe-cluster --name=<CLUSTER_NAME>

where CLUSTER_NAME is specified in the config file. For example, if CLUSTER_NAME=my_cluster and you’ve started an ec2 cluster, your output will resemble:

[master <clovr_ip>]$ vp-describe-cluster --name=my_cluster
MASTER  i-571c113d      ec2-72-44-39-80.compute-1.amazonaws.com running
GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia
ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis
SSH     ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no
-o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30
-o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem root@ec2-72-44-39-80.compute-1.amazonaws.com

To monitor the status of your pipeline, navigate to the Ergatis and Ganglia links in the output. If you’re not sure about the name of your cluster, use the following command to obtain a list of all active clusters:

vp-describe-cluster --list

Downloading Output

OUTPUT TARBALL: CloVR-Metagenomics output

Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:

[output]
OUTPUT_DIR=/mnt/output

Navigating to this directory we should find a tarball file containing the results of the pipeline run which can be extracted using tar (in Unix) or utility such as WinZip or WinRAR (in Windows).

The CloVR-Metagenomics pipeline outputs several different files for the user:

Output Description
read_mapping A text file displaying the one-to-one mapping of sequence names created in the pipeline.
uclust_clusters Raw text output from uclust runs.
artificial_replicates A list of read names wthat were found to be artificial replicates from the 454 platform.
blast_functional Raw output of blast hits of representative sequences to a functional DB.
tables_functional Summary tables of functional categories for each sample.
piecharts_functional Visualized piecharts for functional groups.
skiff_functional Output of skiff clusterings for different functional levels.
metastats_functional Output of Metastats analysis comparing subject groups or samples at different functional levels.
histograms_functional Visualized stacked histograms of functional annotations.
blast_taxonomy Raw output of blast hits of representative sequences to a taxonomic DB.
tables_taxonomy Summary tables of taxonomy groups for each sample.
piecharts_taxonomy Visualized piecharts for taxonomic groups.
skiff_taxonomy Output of skiff clusterings for different taxonomic levels.
metastats_taxonomy Output of Metastats analysis comparing subject groups or samples at different taxonomic levels.
histograms_taxonomy Visualized stacked histograms of taxonomic annotations.

5. Terminating a cluster

When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed.  To terminate a cluster, enter your cluster name

vp-terminate-cluster --cluster=cluster_name

Interrupting a pipeline

If the execution of CloVR-Metagenomics is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop.  It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.

Recovering from error and restarting the pipeline

If the execution of CloVR-Metagenomics fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.

# Name of pipeline
PIPELINE_NAME=clovr_metagenomics_pipeline-2