CloVR-16S v1.0 Walkthrough

The current version of CloVR-16S can be easily run on a local computer with at least 4 GB of RAM and 15 GB of free disk space. In its present form, CloVR-16S provides limited support for parallel processing. Therefore, use of the Cloud does not always significantly reduce the pipeline duration.

Use of the Cloud is entirely optional. If you want to use the Amazon Cloud, be sure to have configured your Amazon EC2 credentials. Also, usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed. (See vp-terminate-cluster command below.)

Inputs

1. A single fasta file with multiplex barcodes & a Qiime-formatted mapping file

OR

2. Multiple fasta files (1 file per sample) & a CloVR-formatted mapping file

Download a Dataset

In preparation for the CloVR-16S pipeline run, download and extract a dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/. Each dataset above has a mapping file as well, so be sure it is also extracted to the shared folder.

If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.

Pipeline Execution

1. Tagging data

Before starting a pipeline, you must first tag your data. CloVR makes use of a tagging system in its pipelines for data being uploaded and downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process.

For this pipeline, you need to tag one or more fasta files and a corresponding mapping file (depending on your data, the mapping file will be in either CloVR or Qiime format). This first command will tag a single fasta file as clovr_16S_input.

vp-add-dataset -o --tag-name=clovr_16S_input /mnt/AMP_Lung.small.fasta

If any secondary directories have been created within /mnt/ the file paths should be updated to reflect this. To tag multiple fasta files with the vp-add-dataset command you can add the paths to each fasta file or use a unix generic such as /mnt/*.fasta e.g.:

vp-add-dataset -o --tag-name=clovr_16S_input /mnt/Afile /mnt/Bfile /mnt/Cfile vp-add-dataset -o --tag-name=clovr_16S_input /mnt/*.fasta

The next command will tag the mapping file:

vp-add-dataset -o --tag-name=clovr_16S_mapping /mnt/<mapping-file>

Note you have to provide a unique tag-name in each command. Also note that if the files associated with the tag-names change, the vp-add-dataset command has to be repeated. You can chose your own tag-name. You will need these names when you edit the pipeline configuration file below.

You can use the command vp-describe-dataset to see all existing tags. CloVR-16S will use more tags in addition to those defined by the user.

2. Editing the configuration file

FILE: CloVR-16S configuration file

Use the configuration file to define parameters in various components of CloVR-16S, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. The configuration file also determines whether CloVR-16S will be executed locally or whether the Amazon Cloud will be used.

Copy the configuration file into the same “shared” folder as the input file and access it in the /mnt/ directory. The configuration file detailed below can be found in the link above.

[input]
# Input fasta tag, this is what you tagged your fasta files with
FASTA_TAG=clovr_16S_input

# Mapping tag for pipeline
MAPPING_TAG=clovr_16S_mapping

# Reference database, do not modify.
REF_DB_TAG=clovr-core-set-aligned-imputed-fasta

In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above.

The CloVR-16S pipeline supports the use of customized 16S template alignment for the QIIME component of the pipeline (as a default the greengenes core template alignment is used as described on the QIIME project website). For the time being, this tag should not be altered.

[cluster]
# Cluster name
CLUSTER_NAME=local
# Credential to use to make the cluster
CLUSTER_CREDENTIAL=local

The cluster section determines the type of cluster, which is used by the CloVR-16S pipeline. This can either be an existing cluster which is already running, or a new cluster that has to be created by the pipeline. A cluster is assigned a unique identifier as defined by the CLUSTER_NAME variable. If  CloVR-16S is run locally, both CLUSTER_NAME and CLUSTER_CREDENTIAL should be “local”. If you have set up different credentials for a Cloud service (see Getting Started section above), you may set them here.

[pipeline]
# Pipeline Name
PIPELINE_NAME=ReplaceThisWithYourPipelineName

# Pipeline Description
PIPELINE_DESC=

Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter.

The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.

3. Running the 16S pipeline

Now that your config file is ready, the CloVR-16S pipeline can be executed from the command-line as:

clovr16S /mnt/CloVR_16S.config &

The clovr16S command launches a cluster as specified by parameters in the config file and starts the CloVR-16S pipeline.

4. Monitoring your pipeline

The pipeline status can be monitored by navigating to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM. The IP address is shown on the Desktop of the CloVR VM or can be obtained with the following command:

vp-describe-cluster --list

This script will return a list of all available clusters. Here is example output returned:

*** Available Clusters ***
CLUSTER local
CLUSTER 16S_cluster

Identify the cluster that the 16S pipeline is running on and provide it to vp-describe-cluster again, but this time with the –name option:

[master <clovr_ip>]$ vp-describe-cluster --name 16S_cluster  
MASTER  i-571c113d      ec2-72-44-39-80.compute-1.amazonaws.com running
GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia
ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis
SSH     ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no
-o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30
-o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem
root@ec2-72-44-39-80.compute-1.amazonaws.com

To monitor the status of your pipeline, navigate to the Ergatis and Ganglia links in the output.

Downloading Output

OUTPUT TARBALL: CloVR 16S output

Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:

[output]
OUTPUT_DIR=/mnt/output

Navigating to this directory we should find a tarball file containing the results of the pipeline run which can be extracted using tar (in Unix) or utility such as WinZip or WinRAR (in Windows).

The CloVR-16S pipeline outputs several different files from two parallel protocols: (i) a Qiime-based analysis, and (ii) a Mothur/RDP-based analysis. Depending on the characteristics of the data, some results may not be generated due to inherent computational difficulties or poor expected results. The outputs are:

Output Description
Qiime-based analysis
qiime_otu_table A text table describing the abundance of each OTU and its assigned taxonomy.
qiime_fasttree A phylogenetic tree of alignable sequences constructed using the fast tree program.
qiime_heatmap An html-based heatmap application summarizing the OTU table information.
qiime_summary_tables Taxonomic summary tables at various phylogenetic levels.
qiime_summary_histograms Visualized stacked histograms of all samples for various taxonomic groups.
qiime_skiff Output of skiff clusterings for different taxonomic levels.
qiime_metastats Output of Metastats analysis comparing subject groups or samples at different taxonomic levels.
qiime_beta A visualized principal coordinate analysis results of unsupervised clusterings with UniFrac. (html/java-based)
Mothur/RDP-based analysis
rdp_res Raw text output of RDP Bayesian classifier runs.
rdp_tables Summarized RDP-based taxnomic counts for all samples.
rdp_skiff Output of skiff clusterings for different taxonomic levels.
rdp_metastats Output of Metastats analysis comparing subject groups or samples at different taxonomic levels.
rdp_histograms Visualized stacked histograms of all samples for various taxonomic groups.
mothur_otu_list OTUs created using Mothur and a range of minimum distance thresholds.
mothur_shannon Shannon diversity indices computed for OTUs for each sample.
mothur_chao Chao1 diversity metrics computed for OTUs for each sample.
mothur_ace Ace diversity metrics computed for OTUs for each sample.
mothur_rare Rarefaction curves computed for OTUs for each sample.
mothur_summary Mothur summary information.

5. Terminating a cluster

When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed. To terminate a cluster, enter your cluster name

vp-terminate-cluster --cluster=cluster_name

Interrupting a pipeline

If the execution of CloVR-16S is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill” button at the top of the page. This will cause the pipeline to stop. It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.

Recovering from error and restarting the pipeline

If the execution of CloVR-16S fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.

# Name of pipeline
PIPELINE_NAME=clovr_16S_pipeline-2