CloVR-16S v1.1 Walkthrough (command-line)

Introduction

Runtimes of CloVR-16S version 1.1 depend largely on whether the optional chimera check of all sequences with UCHIME is being performed. Up to 500K sequences can be easily processed without the chimera checking step in several hours on a local computer with a single processor, 2 GB of RAM and 15 GB of free disk space. The same analysis including chimera check should still complete in less than 24 hours.

In contrast to older versions of CloVR-16S, version 1.1 has lower RAM requirements and parallelizes several steps of the protocol, e.g. the chimera check with UCHIME, calculation of rarefaction curves with Mothur and identification of differentially abundant OTUs with Metastats.

Input

1. A single FASTA file with multiplex barcodes & a QIIME-formatted metadata mapping file

OR

2. Multiple FASTA files (1 file per sample) & a CloVR-formatted metadata mapping file

To process multiple samples with CloVR-16S, either from a single or from multiple FASTA files, additional metadata associated with each sample needs to be provided in the form of a mapping file. This tab-delimited text file specifies, for example, information about barcodes used for multiplex sequencing or groups of related samples.

 

Download Test Datasets and Output

Test datasets are *.tar archives of FASTA and mapping files need to be extracted before they can be used with CloVR-16S.

In preparation for the CloVR-16S pipeline run, download and extract a dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/. Each dataset above has a mapping file as well, so be sure it is also extracted to the shared folder.

If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.

Pipeline Execution

1. Tagging data

Before starting a pipeline, you must first tag your data. CloVR makes use of a tagging system in its pipelines for data being uploaded and downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process.

For this pipeline, you need to tag one or more fasta files and a corresponding mapping file (depending on your data, the mapping file will be in either CloVR or Qiime format). This first command will tag a single fasta file as clovr_16S_input.

vp-add-dataset -o --tag-name=clovr_16S_input /mnt/AMP_Lung.small.fasta

If any secondary directories have been created within /mnt/ the file paths should be updated to reflect this. To tag multiple fasta files with the vp-add-dataset command you can add the paths to each fasta file or use a unix generic such as /mnt/*.fasta e.g.:

vp-add-dataset -o --tag-name=clovr_16S_input /mnt/Afile /mnt/Bfile /mnt/Cfile

vp-add-dataset -o --tag-name=clovr_16S_input /mnt/*.fasta

The next command will tag the mapping file:

vp-add-dataset -o --tag-name=clovr_16S_mapping /mnt/<mapping-file>

Note you have to provide a unique tag-name in each command. Also note that if the files associated with the tag-names change, the vp-add-dataset command has to be repeated. You can chose your own tag-name. You will need these names when you edit the pipeline configuration file below.

You can use the command vp-describe-dataset to see all existing tags. CloVR-16S will use more tags in addition to those defined by the user.

2. Editing the configuration file

FILE: CloVR-16S configuration file

Use the configuration file to define parameters in various components of CloVR-16S, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. The configuration file also determines whether CloVR-16S will be executed locally or whether the Amazon Cloud will be used.

Copy the configuration file into the same “shared” folder as the input file and access it in the /mnt/ directory. The configuration file detailed below can be found in the link above.

[input]
# Input fasta tag
FASTA_TAG=clovr_16S_fasta

# Input quality scores tag
QUAL_TAG=

# Mapping tag for pipeline
MAPPING_TAG=clovr_16S_map

In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above. Optionally, users may also provide quality files corresponding to the input fasta files, which would be tagged under QUAL_TAG. We don’t use quality scores in this walkthrough.

[cluster]
# Cluster name
CLUSTER_NAME=local
# Credential to use to make the cluster
CLUSTER_CREDENTIAL=local

The cluster section determines the type of cluster, which is used by the CloVR-16S pipeline. This can either be an existing cluster which is already running, or a new cluster that has to be created by the pipeline. A cluster is assigned a unique identifier as defined by the CLUSTER_NAME variable. If  CloVR-16S is run locally, both CLUSTER_NAME and CLUSTER_CREDENTIAL should be “local”. If you have set up different credentials for a Cloud service (see Getting Started section above), you may set them here.

[pipeline]
# Pipeline Name
PIPELINE_NAME=ReplaceThisWithYourPipelineName

# Pipeline Description
PIPELINE_DESC=

Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter.

The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.

3. Running the 16S pipeline

Now that your config file is ready, the CloVR-16S pipeline can be executed from the command-line as:

clovr16S /mnt/clovr_16S.config &

The clovr16S command launches a cluster as specified by parameters in the config file and starts the CloVR-16S pipeline.

4. Monitoring your pipeline

The pipeline status can be monitored by navigating to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM. The IP address is shown on the Desktop of the CloVR VM or can be obtained with the following command:

vp-describe-cluster --list

This script will return a list of all available clusters. Here is example output returned:

*** Available Clusters ***
CLUSTER local
CLUSTER 16S_cluster

Identify the cluster that the 16S pipeline is running on and provide it to vp-describe-cluster again, but this time with the –name option:

[master <clovr_ip>]$ vp-describe-cluster --name 16S_cluster 
MASTER  i-571c113d      ec2-72-44-39-80.compute-1.amazonaws.com running
GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia
ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis
SSH     ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no
-o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30
-o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem
root@ec2-72-44-39-80.compute-1.amazonaws.com

To monitor the status of your pipeline, navigate to the Ergatis and Ganglia links in the output.

Downloading Output

 

OUTPUT TARBALL: CloVR 16S output

Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:

[output]
OUTPUT_DIR=/mnt/output

Navigating to this directory we should find a tarball file containing the results of the pipeline run which can be extracted using tar (in Unix) or utility such as WinZip or WinRAR (in Windows).

 

Outputs

Depending on the characteristics of the data, some results may not be generated due to inherent computational difficulties or poor expected results. The outputs are:

Output Description
filtered_sequences Sequences passing the Qiime-based poor-quality filter (filename: seqs.fna)
chimeras Sequence names from seqs.fna identified as putative chimeras (filename: allchimeraids.txt)
uclust_otus Table showing OTU sample compositions (RDP classifier/Qiime)
summary_tables Taxonomic summary tables at various phylogenetic levels.
rarefactions Alpha-diversity: rarefaction numerical curves separated by sample (Mothur)
rarefaction_plots Visualized rarefaction plots separated by metadata type (Leech/CloVR)
mothur_summary Richness and diversity estimators (Mothur)
PCoA_plots Beta-dviersity weighted & unweighted UniFrac 3D PCoA plots (Qiime)
skiff Taxonomic composition-based sample heatmap clustering (Skiff/CloVR)
histograms Taxonomic composition-based stacked histograms (CloVR)
metastats Differentially abundant taxonomic groups (Metastats)