HMP-DACC CloVR DigiNorm Walkthrough

Introduction

This walkthrough provides a simple example of how to set-up and run the CloVR DigiNorm pipeline using the web-browser accessible CloVR dashboard. The pipeline uses the DigiNorm algorithm to normalize the dataset, substantially reducing the size without any significant impact on the assemblies that will be generate. For a detailed description about the DigiNorm algorithm please click here.

For this walkthrough, we shall utilize a sample dataset from the HMP  Illumina WGS Reads – Sample SRS018671.

This pipeline requires as much as 15GB of memory or more. Thus in most cases, additional computational resources may be needed. This walkthrough uses the cloud for computational support. Specifically, we demonstrate how to run the pipeline using the academic cloud DIAG, which is free for researchers. Alternatively, you could run the pipeline on Amazon EC2 or using other cloud computing providers.

Note: This pipeline is not yet available on the lastest CloVR release (clovr-1.0-RC5, Nov. 2012). For now, you can access this pipeline by requesting an account at www.diagcomputing.org. After you login, select “Start CloVR” from the “My Account” drop-down. This will launch a new CloVR VM. Continue with the walkthrough from the Add input datasets to the pipeline step.

Getting started with CloVR

Installing and setting up CloVR is a one-time process. If you have done this before, you may skip to the next step – Setting up input dataset.

Alternatively, you could run this pipeline through DIAG. Simply request an account at www.diagcomputing.org. Once you log-in, click on “My Account” then “Start CloVR”. A CloVR  VM will start up, and you can skip to the next step –  Setting up input dataset to continue.

Install CloVR

CloVR is run using a local desktop client. Visit the Getting started with CloVR page to download and install the client. Once the CloVR virtual machine is set up and launched, you should see a screen similar to Figure 1.

Figure 1. CloVR desktop client

 

Start the CloVR web interface

First check the CloVR desktop window for the IP address of your virtual machine (VM). Then enter this IP address in a web browser as shown in Figure 2.

Figure 2. Accessing the CloVR web interface

 

Add cloud credentials to the pipeline

Visit the Adding Credentials page for steps on how to add DIAG credentials. Once the your DIAG credentials are setup, you should see it listed within the credentials tab as shown in Figure 3.

Figure 3. Credentials

Setting up input dataset

Prepare input datasets

The input to this pipeline is a set of paired-end reads files – FASTQ format. The first step is to move your input data files into the user_data folder located within the clovr-standard-* image directory. This will enable you to easily access the data through the CloVR dashboard. You could skip this step if you’re running this pipeline through www.diagcomputing.org or if plan to upload the input data directly from your computer (more on this below).

Figure 4. Move input files to “user_data” folder. In this example, the input is a set of paired reads files- seq.1.fastq and seq.2.fastq


Add datasets to the pipeline

Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”.  To add tags, click “Add” on the web interface.

Figure 5. Adding new tags

Then click on “Select file from image”, which will open a sub-window where you can select files for upload to the VM. Select the first paired end file to tag.

Alternatively, you can use “Browse” in the “Upload a file” window to find and select files from anywhere on your local computer. If you’re running this pipeline through www.diagcomputing.org, use the “Upload a file” option.

Select “Nucleotide FASTQ” from the “File Type” drop-down menu and name your dataset, e.g. as “SRS018671_1”. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR. A “Completed Successfully” window should appear to indicate that your dataset was added to the CloVR VM. Repeat this step to tag the second paired-end file.
Now that the data we want to analyze has been tagged, we can setup and run the CloVR-diginorm pipeline.

Figure 5. Tagging dataset

Figure 6. Tagging dataset

 

Pipeline setup and execution

To initialize a new pipeline run, click on the “Other Protocols” drop-down as shown in the figure below. Then select “clovr_diginorm”.

Figure 6. Starting a DigiNorm pipeline

Figure 7. Starting a DigiNorm pipeline

 

This will open the pipeline configuration window. Select the Tag corresponding to the input dataset file(s). Select

Select the appropriate credentials from the “Account” drop-down menu. In this case, we’re using DIAG. If you’re running this pipeline through www.diagcomputing.org, select “local”.

Provide a name to recognize your pipeline in the web interface home page as “Pipeline Description”, e.g. “diginorm_test″.

Figure 7. Configuring a diginorm pipeline

Figure 8. Configuring a diginorm pipeline

 

Check your input by clicking “validate”. If the validation is successful, start the pipeline by clicking “Run”.

After a successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running.”

Monitoring the pipeline

Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status (Figure 9). Occasionally, the pipeline may idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis (Figure 10). Additionally, once the pipeline completes, the results can be downloaded from this window by clicking on the Outputs tab. (Figure 11). The outputs consists of two files representing the normalized dataset.

Figure 8.

Figure 9.  Running, failed, idle, and complete pipelines are shown in the major panel.

Figure 9. Ergatis view

Figure 10. Ergatis view of running pipeline

 

Figure 9. Download Diginorm pipeline output

Figure 11. Download Diginorm pipeline output