This walkthrough provides a simple example of how to set-up and run the Metagenomics Assembly pipeline using the web-browser accessible CloVR dashboard. The pipeline generates a “Pretty Good Assembly” a reasonable attempt at reconstructing pieces of the organisms present in the community that are long enough to allow gene finding and other downstream analyses. This version of the pipeline uses SOAPdenovo v.1.04. The HMP Whole-Metagenome Assembly protocol provides a detailed description of the pipeline.
For this walkthrough, we shall utilize a sample from the HMP Anterior Nares body site (Sample SRS019215).
This walkthrough uses the cloud for computational support. Specifically, we demonstrate how to run the pipeline using the academic cloud DIAG, which is free for researchers. Alternatively, you could run the pipeline on Amazon EC2 or using other cloud computing providers.
Getting started with CloVR
Installing and setting up CloVR is a one-time process. If you have done this before, you may skip to the next step – Setting up input dataset.
CloVR is run using a local desktop client. Visit the Getting started with CloVR page to download and install the client. Once the CloVR virtual machine is set up and launched, you should see a screen similar to Figure 1.
Start the CloVR web interface
First check the CloVR desktop window for the IP address of your virtual machine (VM). Then enter this IP address in a web browser as shown in Figure 2.
Add cloud credentials to the pipeline
Visit the Adding Credentials page for steps on how to add DIAG credentials. Once the your DIAG credentials are setup, you should see it listed within the credentials tab as shown in Figure 3.
Setting up input dataset
Prepare input datasets
The first step is to move your input data files into the user_data folder located within the clovr-standard-* image directory. This will enable you to easily access the data through the CloVR dashboard.
Add datasets to the pipeline
Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”. To add tags, click “Add” on the web interface.
Then click on “Select file from image”, which will open a sub-window where you can select a FASTQ file for upload into the VM.
Alternatively, you can use “Browse” in the “Upload File” window to find and select files from anywhere on your local computer.
Select “Nucleotide FASTQ” from the “File Type” drop-down menu and name your dataset, e.g. as “mate_1”. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR. A “Completed Successfully” window should appear to indicate that your dataset was added to the CloVR VM.
Repeat this tagging process for second mate pair. If you have singleton file, tag it also using the same process.
The new datasets will be listed under the “Data Sets” tab on the CloVR web interface.
Pipeline setup and execution
To initialize a new pipeline run, click on the “Other Protocols” drop-down as shown in the figure below. Then select “clovr_metagenomics_assembly”.
This will open the pipeline configuration window. Select the tags corresponding to the input datasets. Then set the following parameters:
|Maximal read length||Read length of sequencing data. Use 100bp for HMP Illumina WGS reads|
|Average insert size||This is an estimate of the library size. Usually obtained during data processing|
|Forward/Reverse Library||0 or 1. Select 1 if reads need to be complimentarily reversed|
|ASM flags||Indicates the part(s) which reads used- contigging and/or scaffolding|
|Pair num cutoff||Number of mates needed to scaffold across a gap|
|Map length||Minimum length of read mapping to a contig|
|K-mer size||Allows odd number between 13 and 127. Use 25 for HMP data|
|Size limit||Report scaffolds above this size. 300 used HMP datasets|
Select “DIAG” credentials from the “Account” drop-down menu.
Provide a name to recognize your pipeline in the web interface home page as “Pipeline Description”, e.g. “assembly_SRS019215″.
Check your input by clicking “validate”. If the validation is successful, start the pipeline by clicking “Run”.
After a successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running.”
Monitoring the pipeline
Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status. Occasionally, the pipeline may be idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis. Clicking on the [Pipeline #] headers in the “Pipeline Information” window will open the Ergatis “Workflow creation and monitoring interface” in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs.
Accessing the outputs
Once the pipeline completes, the results can be downloaded from this CloVR dashboard by clicking on the Outputs tab (see Figure below). All results files are created as compressed archives (.tar.gz), which can be extracted using the Finder in Mac OS X, the Tar utility in Unix or programs such as WinZip or WinRAR, in Windows.
Examiming the outputs
The CloVR-Metagenomics Assembly pipeline outputs the following files:
|soapdenovo_assemblies||Fastq record of assembled sequences|
|asembly_stats||Assembly statistics such as: number of assemblies, minimum and maximum contig size, N50, etc|