Runtimes of CloVR-16S version 1.1 depend largely on whether the optional chimera check of all sequences with UCHIME is being performed. Up to 500K sequences can be easily processed without the chimera checking step in several hours on a local computer with a single processor, 2 GB of RAM and 15 GB of free disk space. The same analysis including chimera check should still complete in less than 24 hours.
In contrast to older versions of CloVR-16S, version 1.1 has lower RAM requirements and parallelizes several steps of the protocol, e.g. the chimera check with UCHIME, calculation of rarefaction curves with Mothur and identification of differentially abundant OTUs with Metastats.
1. A single FASTA file with multiplex barcodes & a QIIME-formatted metadata mapping file
2. Multiple FASTA files (1 file per sample) & a CloVR-formatted metadata mapping file
To process multiple samples with CloVR-16S, either from a single or from multiple FASTA files, additional metadata associated with each sample needs to be provided in the form of a mapping file. This tab-delimited text file specifies, for example, information about barcodes used for multiplex sequencing or groups of related samples.
Download Test Datasets and Output
- Single FASTA dataset + mapping file: CloVR-16S mini example single FASTA
- Multiple FASTA dataset + mapping file: CloVR-16S mini example multiple FASTAs
- Output of single FASTA dataset run: CloVR-16S single FASTA example output
Test datasets are *.tar archives of FASTA and mapping files need to be extracted before they can be used with CloVR-16S.
In preparation for the CloVR-16S pipeline run, download and extract a dataset to the shared folder in the virtual machine (VM) directory to allow easy access when working from within the CloVR VM. In the VM, the shared folder is accessible as /mnt/. Each dataset above has a mapping file as well, so be sure it is also extracted to the shared folder.
If you are using Virtual Box and are having problems accessing your data in the shared folder, check Troubleshooting on Virtual Box.
1. Tagging data
Before starting a pipeline, you must first tag your data. CloVR makes use of a tagging system in its pipelines for data being uploaded and downloaded with unique names. These unique names are used throughout the whole system during many steps in the pipeline process.
For this pipeline, you need to tag one or more fasta files and a corresponding mapping file (depending on your data, the mapping file will be in either CloVR or Qiime format). This first command will tag a single fasta file as clovr_16S_input.
vp-add-dataset -o --tag-name=clovr_16S_input /mnt/AMP_Lung.small.fasta
If any secondary directories have been created within /mnt/ the file paths should be updated to reflect this. To tag multiple fasta files with the vp-add-dataset command you can add the paths to each fasta file or use a unix generic such as /mnt/*.fasta e.g.:
vp-add-dataset -o --tag-name=clovr_16S_input /mnt/Afile /mnt/Bfile /mnt/Cfile vp-add-dataset -o --tag-name=clovr_16S_input /mnt/*.fasta
The next command will tag the mapping file:
vp-add-dataset -o --tag-name=clovr_16S_mapping /mnt/<mapping-file>
Note you have to provide a unique tag-name in each command. Also note that if the files associated with the tag-names change, the vp-add-dataset command has to be repeated. You can chose your own tag-name. You will need these names when you edit the pipeline configuration file below.
You can use the command vp-describe-dataset to see all existing tags. CloVR-16S will use more tags in addition to those defined by the user.
2. Editing the configuration file
Use the configuration file to define parameters in various components of CloVR-16S, as well as to determine input, output, and log files or to fine-tune other options that control the pipeline. The configuration file also determines whether CloVR-16S will be executed locally or whether the Amazon Cloud will be used.
Copy the configuration file into the same “shared” folder as the input file and access it in the /mnt/ directory. The configuration file detailed below can be found in the link above.
[input] # Input fasta tag FASTA_TAG=clovr_16S_fasta # Input quality scores tag QUAL_TAG= # Mapping tag for pipeline MAPPING_TAG=clovr_16S_map
In this pipeline the input tags FASTA_TAG and MAPPING_TAG must match the tags you used with the vp-add-dataset commands above. Optionally, users may also provide quality files corresponding to the input fasta files, which would be tagged under QUAL_TAG. We don’t use quality scores in this walkthrough.
[cluster] # Cluster name CLUSTER_NAME=local
# Credential to use to make the cluster CLUSTER_CREDENTIAL=local
The cluster section determines the type of cluster, which is used by the CloVR-16S pipeline. This can either be an existing cluster which is already running, or a new cluster that has to be created by the pipeline. A cluster is assigned a unique identifier as defined by the CLUSTER_NAME variable. If CloVR-16S is run locally, both CLUSTER_NAME and CLUSTER_CREDENTIAL should be “local”. If you have set up different credentials for a Cloud service (see Getting Started section above), you may set them here.
[pipeline] # Pipeline Name PIPELINE_NAME=ReplaceThisWithYourPipelineName # Pipeline Description PIPELINE_DESC=
Each pipeline run requires a unique name PIPELINE_NAME so that the CloVR system can download the correct set of output, after the pipeline has finished. This parameter is especially important if multiple pipelines are running on the same cluster. You may also optionally add a description of the pipeline by setting the PIPELINE_DESC parameter.
The rest of the configuration file represents advanced settings and should not be altered for this walkthrough.
3. Running the 16S pipeline
Now that your config file is ready, the CloVR-16S pipeline can be executed from the command-line as:
clovr16S /mnt/clovr_16S.config &
The clovr16S command launches a cluster as specified by parameters in the config file and starts the CloVR-16S pipeline.
4. Monitoring your pipeline
The pipeline status can be monitored by navigating to the Ergatis web interface. This requires knowing the IP address of the CloVR EC2 master node or of the local CloVR VM. The IP address is shown on the Desktop of the CloVR VM or can be obtained with the following command:
This script will return a list of all available clusters. Here is example output returned:
*** Available Clusters *** CLUSTER local CLUSTER 16S_cluster
Identify the cluster that the 16S pipeline is running on and provide it to vp-describe-cluster again, but this time with the –name option:
[master <clovr_ip>]$ vp-describe-cluster --name 16S_cluster MASTER i-571c113d ec2-72-44-39-80.compute-1.amazonaws.com running GANGLIA http://ec2-72-44-39-80.compute-1.amazonaws.com/ganglia ERGATIS http://ec2-72-44-39-80.compute-1.amazonaws.com/ergatis SSH ssh -oNoneSwitch=yes -oNoneEnabled=yes -o PasswordAuthentication=no -o ConnectTimeout=30 -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -q -i /mnt/keys/devel1.pem email@example.com
OUTPUT TARBALL: CloVR 16S output
Once your pipeline run has run to completion the files are automatically downloaded to your local VM and can be found in the output directory as specified in the pipeline configuration file:
Depending on the characteristics of the data, some results may not be generated due to inherent computational difficulties or poor expected results. The outputs are:
|filtered_sequences||Sequences passing the Qiime-based poor-quality filter (filename: seqs.fna)|
|chimeras||Sequence names from seqs.fna identified as putative chimeras (filename: allchimeraids.txt)|
|uclust_otus||Table showing OTU sample compositions (RDP classifier/Qiime)|
|summary_tables||Taxonomic summary tables at various phylogenetic levels.|
|rarefactions||Alpha-diversity: rarefaction numerical curves separated by sample (Mothur)|
|rarefaction_plots||Visualized rarefaction plots separated by metadata type (Leech/CloVR)|
|mothur_summary||Richness and diversity estimators (Mothur)|
|PCoA_plots||Beta-dviersity weighted & unweighted UniFrac 3D PCoA plots (Qiime)|
|skiff||Taxonomic composition-based sample heatmap clustering (Skiff/CloVR)|
|histograms||Taxonomic composition-based stacked histograms (CloVR)|
|metastats||Differentially abundant taxonomic groups (Metastats)|