HMP-DACC CloVR Gene Clustering Walkthrough

Introduction

This walkthrough provides a simple example of how to set-up and run the CloVR-gene clustering pipeline using the web-browser accessible CloVR dashboard. The pipeline takes as input gene predictions coming out of metagenomic wgs sequences and generates a non-redundant gene set. This version of the pipeline uses USEARCH 32-bit v6.0.307. The HMP Non-Redundant Clustered Gene Index protocol provides a detailed description of the pipeline.

This walkthrough demonstrates how to run the pipeline on a local computer or using the academic cloud DIAG, which is free for researchers. Alternatively, you could run the pipeline on Amazon EC2 or using other cloud computing providers. For this walkthrough, we shall utilize the predicted gene set from the mid-vagina body site.

If you use this pipeline in published work, please cite USEARCH and CloVR. Also note that this pipeline is limited to using 4Gb or less memory. Processing a larger dataset requires a paid USEARCH 64-bit license.

 

Note: This pipeline is not yet available on the lastest CloVR release (clovr-1.0-RC5, Nov. 2012). For now, you can access this pipeline by requesting an account at www.diagcomputing.org. After you login, select “Start CloVR” from the “My Account” drop-down. This will launch a new CloVR VM. Continue with the walkthrough from the Add input datasets to the pipeline step.

 

Getting started with CloVR

Installing and setting up CloVR is a one-time process. If you have done this before, you may skip to the next step – Setting up input dataset.

Install CloVR

CloVR is run using a local desktop client. Visit the Getting started with CloVR page to download and install the client. Once the CloVR virtual machine is set up and launched, you should see a screen similar to Figure 1.

Figure 1. CloVR desktop client

 

Start the CloVR web interface

First check the CloVR desktop window for the IP address of your virtual machine (VM). Then enter this IP address in a web browser as shown in Figure 2.

Figure 2. Accessing the CloVR web interface

 

Add cloud credentials to the pipeline

Visit the Adding Credentials page for steps on how to add DIAG credentials. Once the your DIAG credentials are setup, you should see it listed within the credentials tab as shown in Figure 3.

Figure 3. Credentials

 

Setting up input dataset

Prepare input datasets

The first step is to move your input data files into the user_data folder located within the clovr-standard-* image directory. This will enable you to easily access the data through the CloVR dashboard.

Figure 4. Move input files to “user_data” folder. In this example, the input is sample

Figure 4. Move input files to “user_data” folder. In this example, the input is sample mid_vagina_genes.fasta

 

Add datasets to the pipeline

Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”.  To add tags, click “Add” on the web interface.

Figure 5. Adding new tags

Then click on “Select file from image”, which will open a sub-window where you can select a FASTA file for upload into the VM.

Alternatively, you can use “Browse” in the “Upload File” window to find and select files from anywhere on your local computer.

Select “Protein FASTA” or “Nucleotide FASTA” from the “File Type” drop-down menu and name your dataset, e.g. as “midvagina_genes”. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR. A “Completed Successfully” window should appear to indicate that your dataset was added to the CloVR VM.

Figure 6. Adding a FASTA dataset

Figure 6. Adding a FASTA dataset

 

The new dataset will be listed under the “Data Sets” tab on the CloVR web interface.

 

Figure 7. Tagged Dataset

Figure 7. Tagged Dataset

Pipeline setup and execution

To initialize a new pipeline run, click on the “Other Protocols” drop-down as shown in the figure below. Then select “clovr_gene_clustering”.

 

Figure 8. Starting a new pipeline

Figure 8. Starting a new pipeline

 

This will open the pipeline configuration window.

From the drop-down, select the tags corresponding to the input file: “midvagina_genes”. Select “DIAG” or “local” credentials from the “Account” drop-down menu.

Then set the following parameters:

Parameter Description
Identity threshold This specifies the minimum identity between the query sequence and the target sequence. In this case, the target sequence is the centroid.Ranges from 0.0 to 1.0
Minimum sequence length This specifies the minimum length for sequences to be kept. Shorter sequences will be discarded.

Provide a name to recognize your pipeline in the web interface home page as “Pipeline Description”, e.g. “Clustering Mid vagina genes″.

Figure 9. Configuring a Gene Clustering pipeline

Figure 9. Configuring a Gene Clustering pipeline

 

Check your input by clicking “validate”. If the validation is successful, start the pipeline by clicking “Run”.

After a successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running.”

 

Monitoring the pipeline

Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status. Occasionally, the pipeline may be idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis. Clicking on the [Pipeline #] headers in the “Pipeline Information” window will open the Ergatis “Workflow creation and monitoring interface” in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs.

Figure 10. Pipeline status

Figure 10. Pipeline status

 

Accessing the outputs

Once the pipeline completes, the results can be downloaded from this CloVR dashboard by clicking on the Outputs tab (see Figure below). All results files are created as compressed archives (.tar.gz), which can be extracted using the Finder in Mac OS X, the Tar utility in Unix or programs such as WinZip or WinRAR, in Windows.

 

Figure 11. Accessing output files

Figure 11. Accessing output files

 

Due to memory limitations of the free version of USEARCH, this pipeline only allows input files containing 10 million sequences or less. For files with a larger number of sequences, you may need to split the file, and run through the pipeline in an iterative fashion.

 

Examiming the outputs

The CloVR-Metagenomics Assembly pipeline outputs the following files:

Output Description
nonredundant_fasta Non-redundant gene set – Fasta format
clusters_and_hits USearch cluster format file. Tab-separated text file of clusters, hits and details.
stats Clustering statistics such as number of sequences, number of clusters, etc