HMP-DACC CloVR 16S Walkthrough

Introduction

This walkthrough provides a simple example of how to set-up and run the CloVR-16S pipeline using the web-browser accessible CloVR dashboard, as well as analyze the resulting outputs. We shall utilize 16S rRNA amplicon sequences representing microbial communities extracted from 12 hard-palate and 12 attached-keratinized gingiva oral environments.

Getting started with CloVR

CloVR is run using a local desktop client.  Visit the Download and  Getting started with CloVR to download and install the client.

Specifying input data

  1. First download an HMP dataset and uncompress it. The example data used for this walkthrough (10.9 MB) consists of 24 fasta files (1 per sample), and a single metadata mapping file describing important information about each sample. The metadata mapping file is tab-delimited with a series of columns; we have designed it to allow for comparison of groups of interest. In this case, we are comparing the hard palate samples to the attached keratinized gingiva samples to see if there are differentially abundant taxa between the two environments.
  2. Next, move the entire folder HMPtestset1 into the user_data folder located within the clovr-standard-* image directory (see Figure 4 for an example). This will enable us to easily access the data through the CloVR dashboard.

Figure 4 Move the HMPtestset1 folder into the user_data directory within the clovr-standard* directory. In this example, there are two separate datasets in the user_data folder: Cheese_632 and HMPtestset1.

  1. To add (or tag as we say) files for input to a pipeline, first select the “Data Sets” tab in the upper left corner of the dashboard, then select “Add” at the bottom of the corresponding left panel. This will bring up a new window to add data.
  2. Using this new window, first click the “Select file from image button” to access the user_data folder where our data lives. You can easily select all fasta files by clicking the checkbox next to the fastas folder within HMPtestset1 directory. Set the file type as Nucleotide fasta, and name and describe your data. You can pick anything you like for a name, but it may be easier to use the same name in Figure 5. Finally, click the “tag” button to add the data.
  3. Do the same procedure for the single mapping file (HMP_Oral_Comp1.map.txt) as shown in Figure 6. At this point, both datasets should appear in the left side panel of the CloVR dashboard organized by data type.

Now that the data we want to analyze has been tagged, we can setup and run the CloVR-16S pipeline.

Figure 5 Tagging fasta files in CloVR. In this example, we’ve named the dataset HMPtestset1_fastafiles. Because we’ve put the fasta files in a single separate directory, we can select the entire directory in a single click.

 

Figure 6 Tagging the metadata mapping file in CloVR. In this example, we’ve named it HMPtestset1_metamap.

Pipeline setup and execution

  1. Select the CloVR-16S button in the upper right region of the CloVR dashboard. This will bring up a form in the panel below to choose tagged datasets and parameters in the pipeline. Note that the standard operating procedure and description of the pipeline is available at the URL: http://clovr.org/methods/clovr-16s/.  In the set of next steps, you can follow along with Figure 7.
  2. Choose the HMPtestset1_fastafiles dataset (or whatever you named these fasta files) from the first menu in the form. Similarly, select the corresponding CloVR mapping file by clicking the “Change” button next to the form. In this example, we do not use quality scores, but that is an option for the user.
  3. The CloVR-16S pipeline allows for execution with and without computationally intensive chimera checking, so if in this example, you want the pipeline to run very quickly, select without chimera checking, otherwise select the button to employ chimera checking.

Figure 7 Setting up the CloVR-16S pipeline. This form allows the user to select the input datasets, set parameters for the pipeline, choose to run locally or on a cloud, and provide a short description for the pipeline.

  1. In the box next to the Account label, you can select to run the pipeline locally (on your own machine) or if you have the credentials set (see prior section), you can choose to run the pipeline on a cloud (e.g. DIAG). In Figure 7, we have named our DIAG credentials jdiag. Also, give your pipeline a description that makes sense to you.
  2. To have CloVR check your input files for consistency, select the Validate button at the bottom of the panel. If this succeeds, then select the Submit button to execute the pipeline.

Monitoring the pipeline

Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status (Figure 8). Occasionally, the pipeline may idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis (Figure 9). Additionally, once the pipeline completes, the results can be downloaded from this window by clicking on the Outputs tab. (Figure 10).

Figure 8 Running, failed, idle, and complete pipelines are shown in the major panel.

Figure 9 Clicking on a pipeline brings up a window describing the input datasets, parameters and other important information. Selecting the hyperlinks at the top of this window will take you to the Ergatis workflow monitoring interface to show where the pipeline is in its execution.

Figure 10 Once a pipeline finishes, the results are downloaded (if the pipeline was run on the cloud) and made available through the CloVR dashboard.

Examining the outputs

Let’s take a look at some of the outputs to see what information we can gather.

 

Figure 11 Rarefaction plots output by the CloVR-16S pipeline. Samples can be automated colored by groups defined in the metadata mapping file provided by the user.

To initially assess the data it can be helpful to look at the alpha-diversity of each sample using rarefaction curves. CloVR-16S computes and visualizes these curves using information provided in the metadata mapping file. Two of the rarefaction plots output from the pipeline are shown in Figure 11. We see that some samples appear more than twice as diverse than others, and that the number of high-quality sequences per sample varies largely.

 

Figure 12 Stacked histograms output describing taxonomic group abundances in each sample.

A stacked histogram describing the relative abundances of taxonomic groups is in Figure 12. We can tell immediately from this figure that a few phyla dominate all samples including Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes. We can see that Actinobacteria tends to be less abundant in attached-keratinized gingiva samples with the exception of one, which may be an outlier.

 

Figure 13 Skiff unsupervised clustering visualization. Skiff plots combine several types of information to allow the user to quickly determine which samples are most similar across phylogenetic profiles. A heatmap describes the relative abundance of each taxonomic class in each sample. In this case, the values in the heatmap represent the logarithm of the proportion (i.e. -1~10%, -2~1%, -3~0.1%).

Finally, a skiff plot showing clustered samples and taxa is shown in Figure 13. In this plot we see that the two sample types separate fairly well, though not perfectly. Skiff plots are output for all taxonomic levels including phylum, class, order, family, and genus.

Additional outputs are described in the CloVR-16S SOP, available at http://clovr.org.