The current version of CloVR-16SÂ can be easily run on a local computer with at least 4 GB of RAM and 15 GB of free disk space. In its present form, CloVR-16SÂ provides limited support for parallel processing. Therefore, use of the Cloud does not always significantly reduce the pipeline duration.
Use of the Cloud is entirely optional. If you want to use the Amazon Cloud, be sure to have configured your Amazon EC2 credentials. Also, usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed. (See vp-terminate-cluster command below.)
Inputs
1. A single fasta file with multiplex barcodes & a Qiime-formatted mapping file
OR
2. Multiple fasta files (1 file per sample) & a CloVR-formatted mapping file
Download a Dataset
- Single fasta dataset: CloVR-16S mini example single fasta
- Multiple fasta dataset: CloVR-16S mini example multiple fastas
In preparation for the CloVR-16S pipeline run, download and extract a dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/. Each dataset above has a mapping file as well, so be sure it is also extracted to the shared folder.
If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.
Pipeline Execution
1. Tagging data
Before starting a pipeline, you must first tag your data. CloVR makes use of a tagging system in its pipelines for data being uploaded and downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process.
For this pipeline, you need to tag one or more fasta files and a corresponding mapping file (depending on your data, the mapping file will be in either CloVR or Qiime format). This first command will tag a single fasta file as clovr_16S_input.
vp-add-dataset -o --tag-name=clovr_16S_input /mnt/AMP_Lung.small.fasta
If any secondary directories have been created within /mnt/ the file paths should be updated to reflect this. To tag multiple fasta files with the vp-add-dataset command you can add the paths to each fasta file or use a unix generic such as /mnt/*.fasta e.g.:
vp-add-dataset -o --tag-name=clovr_16S_input /mnt/Afile /mnt/Bfile /mnt/Cfile vp-add-dataset -o --tag-name=clovr_16S_input /mnt/*.fasta
The next command will tag the mapping file:
vp-add-dataset -o --tag-name=clovr_16S_mapping /mnt/<mapping-file>
Note you have to provide a unique tag-name in each command. Also note that if the files associated with the tag-names change, the vp-add-dataset command has to be repeated. You can chose your own tag-name. You will need these names when you edit the pipeline configuration file below.
You can use the command vp-describe-dataset to see all existing tags. CloVR-16SÂ will use more tags in addition to those defined by the user.
2. Editing the configuration file
FILE: CloVR-16S configuration file
Use the configuration file to define parameters in various components of CloVR-16S, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. The configuration file also determines whether CloVR-16S will be executed locally or whether the Amazon Cloud will be used.
Copy the configuration file into the same “shared†folder as the input file and access it in the /mnt/ directory. The configuration file detailed below can be found in the link above.
[input] # Input fasta tag, this is what you tagged your fasta files with FASTA_TAG=clovr_16S_input # Mapping tag for pipeline MAPPING_TAG=clovr_16S_mapping # Reference database, do not modify. REF_DB_TAG=clovr-core-set-aligned-imputed-fasta
In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above.
The CloVR-16SÂ pipeline supports the use of customized 16S template alignment for the QIIME component of the pipeline (as a default the greengenes core template alignment is used as described on the QIIME project website). For the time being, this tag should not be altered.
[cluster] # Cluster name CLUSTER_NAME=local
# Credential to use to make the cluster CLUSTER_CREDENTIAL=local
The cluster section determines the type of cluster, which is used by the CloVR-16S pipeline. This can either be an existing cluster which is already running, or a new cluster that has to be created by the pipeline. A cluster is assigned a unique identifier as defined by the CLUSTER_NAME variable. If  CloVR-16S is run locally, both CLUSTER_NAME and CLUSTER_CREDENTIAL should be “localâ€. If you have set up different credentials for a Cloud service (see Getting Started section above), you may set them here.
[pipeline] # Pipeline Name PIPELINE_NAME=ReplaceThisWithYourPipelineName # Pipeline Description PIPELINE_DESC=
Each pipeline run requires a unique name PIPELINE_NAME so that the CloVRÂ system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter.
The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.
3. Running the 16SÂ pipeline
Now that your config file is ready, the CloVR-16SÂ pipeline can be executed from the command-line as:
clovr16S /mnt/CloVR_16S.config &
The clovr16S command launches a cluster as specified by parameters in the config file and starts the CloVR-16S pipeline.
4. Monitoring your pipeline
The pipeline status can be monitored by navigating to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM. The IP address is shown on the Desktop of the CloVR VM or can be obtained with the following command:
vp-describe-cluster --list
This script will return a list of all available clusters. Here is example output returned:
*** Available Clusters *** CLUSTER local CLUSTER 16S_cluster
Identify the cluster that the 16S pipeline is running on and provide it to vp-describe-cluster again, but this time with the –name option:
[master <clovr_ip>]$ vp-describe-cluster --name 16S_cluster MASTER i-571c113d ec2-72-44-39-80.compute-1.amazonaws.com running GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis SSH ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem root@ec2-72-44-39-80.compute-1.amazonaws.com
To monitor the status of your pipeline, navigate to the Ergatis and Ganglia links in the output.
Downloading Output
OUTPUT TARBALL: CloVR 16S output
Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:
[output] OUTPUT_DIR=/mnt/output
Navigating to this directory we should find a tarball file containing the results of the pipeline run which can be extracted using tar (in Unix) or utility such as WinZip or WinRAR (in Windows).
The CloVR-16SÂ pipeline outputs several different files from two parallel protocols: (i) a Qiime-based analysis, and (ii) a Mothur/RDP-based analysis. Depending on the characteristics of the data, some results may not be generated due to inherent computational difficulties or poor expected results. The outputs are:
Output | Description |
---|---|
Qiime-based analysis | |
qiime_otu_table | A text table describing the abundance of each OTU and its assigned taxonomy. |
qiime_fasttree | A phylogenetic tree of alignable sequences constructed using the fast tree program. |
qiime_heatmap | An html-based heatmap application summarizing the OTU table information. |
qiime_summary_tables | Taxonomic summary tables at various phylogenetic levels. |
qiime_summary_histograms | Visualized stacked histograms of all samples for various taxonomic groups. |
qiime_skiff | Output of skiff clusterings for different taxonomic levels. |
qiime_metastats | Output of Metastats analysis comparing subject groups or samples at different taxonomic levels. |
qiime_beta | A visualized principal coordinate analysis results of unsupervised clusterings with UniFrac. (html/java-based) |
Mothur/RDP-based analysis | |
rdp_res | Raw text output of RDP Bayesian classifier runs. |
rdp_tables | Summarized RDP-based taxnomic counts for all samples. |
rdp_skiff | Output of skiff clusterings for different taxonomic levels. |
rdp_metastats | Output of Metastats analysis comparing subject groups or samples at different taxonomic levels. |
rdp_histograms | Visualized stacked histograms of all samples for various taxonomic groups. |
mothur_otu_list | OTUs created using Mothur and a range of minimum distance thresholds. |
mothur_shannon | Shannon diversity indices computed for OTUs for each sample. |
mothur_chao | Chao1 diversity metrics computed for OTUs for each sample. |
mothur_ace | Ace diversity metrics computed for OTUs for each sample. |
mothur_rare | Rarefaction curves computed for OTUs for each sample. |
mothur_summary | Mothur summary information. |
5. Terminating a cluster
When utilizing a cluster on EC2, you must terminate the cluster after the pipeline and download have completed. To terminate a cluster, enter your cluster name
vp-terminate-cluster --cluster=cluster_name
Interrupting a pipeline
If the execution of CloVR-16S is not going well for some reason or you realize you have made a mistake, you can interrupt the pipeline by visiting the Ergatis link describing the running pipeline, and clicking the “kill†button at the top of the page. This will cause the pipeline to stop. It may take a minute to effectively halt the pipeline. See below on restarting a pipeline.
Recovering from error and restarting the pipeline
If the execution of CloVR-16S fails and the pipeline has to be restarted, CloVR will attempt to resume the previous run, if the same command is used. In order to start the pipeline from scratch, PIPELINE_NAME should be changed in the config file to a different name. Also, note that if you have made any changes to the input data, you will need to re-tag it using vp-add-dataset.
# Name of pipeline PIPELINE_NAME=clovr_16S_pipeline-2