HMP-DACC CloVR Metagenomics Walkthrough

Introduction

This walkthrough provides a simple example of how to set-up and run the CloVR-Metagenomics pipeline using the web-browser accessible CloVR dashboard, as well as analyze the resulting outputs. We shall utilize HMP WGS reads representing microbial communities extracted from the mid vagina and vaginal introitus body sites.

This walkthrough uses the cloud for computational support. Specifically, we demonstrate how to run the pipeline using the academic cloud DIAG, which is free for researchers. Alternatively, you could run the pipeline on Amazon EC2 or using other cloud computing providers.

The CloVR-Metagenomics SOP provides a detailed description of the CloVR-Metagemnonics pipeline.

 

Getting started with CloVR

Installing and setting up CloVR is a one-time process. If you have done this before, you may skip to the next step – Setting up input dataset.

Install CloVR

CloVR is run using a local desktop client. Visit the Getting started with CloVR page to download and install the client. Once the CloVR virtual machine is set up and launched, you should see a screen similar to Figure 1.

Figure 1. CloVR desktop client


Start the CloVR web interface

First check the CloVR desktop window for the IP address of your virtual machine (VM). Then enter this IP address in a web browser as shown in Figure 2.

Figure 2. Accessing the CloVR web interface

 

 

Add cloud credentials to the pipeline

Visit the Adding Credentials page for steps on how to add DIAG credentials.  Once the your DIAG credentials are setup, you should see it listed within the credentials tab as shown in Figure 3.

Figure 3. List of credentials

 

Setting up input dataset

Prepare input datasets

Two sets of files are required as input to the CloVR-Metagenomics pipeline: multiple fasta files and a CloVR-formated mapping file describing important information about each sample. The sample dataset used for this walkthrough consists of four fasta files (one per sample), and a metadata mapping file.

>Multiple fasta files

First download an HMP dataset and uncompress it. For this walkthrough, we use four samples – two mid vagina samples (SRS014466 and SRS015072) and two viginal introitus samples  (SRS014465 and SRS015071). For each sample, the two paired end fastq files were concatenated. Note that the reads files downloaded from the HMP data page must first be converted from fastq to fasta format, as the pipeline requires fasta formatted input files. Several file conversion tools are freely available online.

>Metadata mapping file

The metadata mapping file is tab-delimited with a series of columns which are designed to allow for comparison of groups of interest. In this case, we are comparing the mid vagina samples to the viginal introitus samples to see if there are differentially abundant taxa between the two body sites.

#File SampleName BodySubsite_p Description
vaginal_introitus_SRS014465.fasta SRS014465 vaginal_introitus Vaginal_introitus_visit_1_subject_763577454
vaginal_introitus_SRS015071.fasta SRS015071 vaginal_introitus Vaginal_introitus_visit_2_subject_763577454
mid_vagina_SRS014466.fasta SRS014466 mid_vagina Mid_vagina_visit_1_subject_763577454
mid_vaginal_SRS015072.fasta SRS015072 mid_vagina Mid_vagina_visit_2_subject_763577454

Table 1: Sample metadata mapping file

 

Prepare a metadata file specific to your samples of interest by creating a subset of the HMP metagenomics metadata master file.

Next, move the entire dataset folder (in this case, HMP_metagenomics_sample_dataset) into the user_data folder located within the clovr-standard-* image directory (Figure 4). This will enable us to easily access the data through the CloVR dashboard.

Figure 4. Move input files to "user_data" folder

 

Add input datasets to the pipeline

Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”.  To add tags, click “Add” on the web interface.

FIgure 5. Adding new tags

 

Then click on “Select file from image”, which will open a sub-window where you can select one or multiple FASTA files for upload into the VM. Alternatively, you can use “Browse” in the “Upload File” window to find and select files from anywhere on your local computer, but multiple files have to be uploaded in separate steps.

Select “Nucleotide FASTA” from the “File Type” drop-down menu and name your dataset, e.g. as “metagenomics_fasta”. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR. A “Completed Successfully” window should appear to indicate that your datasets was added to the CloVR VM and the new dataset should be listed under “Data Sets” on the web interface.

Figure 6. Adding a fasta dataset

 

Next repeat the same process for the corresponding metadata mapping file. This time select “Metagenomics mapping file” from the “File Type” drop-down menu and name your dataset, e.g. as “metagenomics_mapping”. Click “Tag” again to upload the data to CloVR.

Figure 7. Adding a mapping dataset

 

The tagged datasets will appear as a “Tag” on the CloVR web interface. Multiple files will listed under the same “Tag” name.

Figure 8. Tagged Datasets

 

 

Pipeline setup and execution

To initialize a new pipeline run, select the “Tag” corresponding to your FASTA files in the “Data Sets” window and click on the “CloVR Metagenomics” icon.

This will open the pipeline configuration window. Make sure the correct “Tag” is shown as the “Select Sequencing Dataset” and select the “Tag” corresponding to the correct metadata mapping file as the “CloVR Mapping File”. Choose a protocol with or without ORF calling. By default we do not call ORFs.

Select “DIAG” credentials from “Account” drop-down menu.

Provide a name to recognize your pipeline in the web interface “Home” page as “Pipeline Description”, e.g. “HMP_Metagenomics1″.

Figure 9. Configuring the metagenomics pipeline

 

Check your input by clicking “validate”. If the validation is successful, start the pipeline by clicking “submit”.

After a successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running.”

 

Monitoring the pipeline

Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status. Occasionally, the pipeline may idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis. Clicking on the [Pipeline #] headers in the “Pipeline Information” window will open the Ergatis “Workflow creation and monitoring interface” in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs.

FIgure 10. Pipeline status

 

Accessing the outputs

Once the pipeline completes, the results can be downloaded from this CloVR dashboard by clicking on the Outputs tab (Figure 11). All results files are created as compressed archives (.tar.gz), which can be extracted using the Finder in Mac OS X, the Tar utility in Unix or programs such as WinZip or WinRAR, in Windows.

Figure 11. Accessing output files

 

Examiming the outputs

The CloVR-Metagenomics pipeline outputs the following files:

Output Description
read_mapping A text file displaying the one-to-one mapping of sequence names created in the pipeline.
uclust_clusters Raw text output from uclust runs.
artificial_replicates A list of read names that were found to be artificial replicates from the sequencing platform.
blast_functional Raw output of blast hits of representative sequences to a functional DB.
tables_functional Summary tables of functional categories for each sample.
piecharts_functional Visualized piecharts for functional groups.
skiff_functional Output of skiff clusterings for different functional levels.
metastats_functional Output of Metastats analysis comparing subject groups or samples at different functional levels.
histograms_functional Visualized stacked histograms of functional annotations.
blast_taxonomy Raw output of blast hits of representative sequences to a taxonomic DB.
tables_taxonomy Summary tables of taxonomy groups for each sample.
piecharts_taxonomy Visualized piecharts for taxonomic groups.
skiff_taxonomy Output of skiff clusterings for different taxonomic levels.
metastats_taxonomy Output of Metastats analysis comparing subject groups or samples at different taxonomic levels.
histograms_taxonomy Visualized stacked histograms of taxonomic annotations.

 

Let’s take a look at some of the ouputs to see what we can gather.

 

Figure 12. Stacked histograms output describing functional abundances in each sample

 

Figure 12 shows a stacked histogram describing the relative abundances of specific functional annotations. We can tell immediately from this figure that some functions (such as anino acid tansport/metabolism, carbohydrate transport/metabolism, translation/ribosome and replication/recombination) are more abundant across all samples.

 

Figure 13. Skiff clusterings and phylum level

 

A skiff plot showing  clustered samples and taxa is shown in Figure 13. Skiff plots combine several types of information to allow the user to quickly determine which samples are most similar across phylogenetic profiles. A heatmap describes the relative abundance of each taxonomic class in each sample. In this plot, we see that the sample from the two sample types separate fairly well, though not perfectly. Skiff plots are output for all taxonomic levels including phylum, class, order, family, and genus.

Additional outputs are described in the CloVR-Metagenomics SOP.