CloVR-Microbe Walkthrough

Introduction

To run CloVR-Microbe version 1.0 cloud support is recommended. Cloud use is optional and CloVR-Microbe as any other CloVR protocol can be run on a local computer. However, several steps of the protocol either have high RAM requirements (assembly) or are computationally extensive (annotation), which can make local execution practically impossible due to long runtimes (e.g. weeks for a standard E. coli genome project). BLAST and HMMer searches that are part of the annotation component of CloVR-Microbe benefit significantly from parallelization across multiple processors on the Cloud. Assembly of 454 sequence data with the Celera assembler requires 4GB, whereas RAM in access of 4GB is required for the assembly of Illumina sequence data with Velvet.

Total runtimes of the CloVR-Microbe depend mostly on the number of predicted protein-coding genes that require functional annotation, which can vary significantly based on the assembly output. A typical run of CloVR-Microbe (e.g. a standard E. coli genome consisting of <100 contigs and a total length of <5 Mbp), independently of the type of input sequence data, usually finishes in under 24 hours. See our publication in PLoS ONE for details.

To use the cloud to run CloVR protocols, you must obtain credentials from one of the supported cloud providers and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed.

 

Input

454: A single SFF file.

OR

Illumina: FASTQ file(s)

 

Download Test Datasets and Output

Test datasets are .tar archives of SFF and FASTQ files and need to be extracted before use. Extracted datasets should be copied to the user_data folder in the CloVR VM. From within the VM, the shared folder is accessible as /mnt/user_data.

Pipeline Execution

1. Starting the CloVR web interface

(identical for all CloVR-protocols)

Check the CloVR desktop window for the IP address of your virtual machine (VM).

Start the web interface accessing the IP address with your browers.

 

2.Adding datasets to the CloVR VM

Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”. The easiest way to add data to the VM is by copying them into the user_data folder, which is a shared folder between the VM and your local computer. Check Troubleshooting CloVR on VirtualBox, if you have problems accessing files in the shared folders.

To add files, click “Add” on the web interface.

 

Add Roche/454 SFF Sequence Data

If your files are in the user_data directory, click on “Select file from image”, which will open a sub-window where you can selecte the SFF file for upload into the VM. Alternatively, you can use “Browse” in the “Upload File” window to find and select files from anywhere on your local computer but multiple files have to be uploaded in separate steps.

Select “Nucleotide SFF” from the “File Type” drop-down menu and name your dataset, e.g. as “CloVRMicrobe_Test_SFF”. This name will appear as a “Tag” on the CloVR web interface. Multiple files will listed under the same “Tag” name. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR.

 

 

A “Completed Successfully” window should appear to indicate that your datasets was added to the CloVR VM and the new dataset should be listed under “Data Sets” on the web interface.

 

Add Illumina FASTQ Sequence Data

If your files are in the user_data directory, click on “Select file from image”, which will open a sub-window where you can selecte FASTQ files for upload into the VM. Illumina paired-end sequence data usually consist in two separate files. FASTQ files with corresponding paired-end reads can be tagged in one step.

 

In the CloVR VM, each dataset is listed with the “Tag” name that was specified as “Name” when the dataset was added. Clicking on the newly added “Tag” opens a new window with information about the associated dataset.

 

 

Compressed or uncompressed archives are uploaded in the same way.

3. Configuring and starting the pipeline

To initialize a new pipeline run, select the “Tag” corresponding to your SFF or FASTQ file(s) in the “Data Sets” window and click on the “CloVR Microbe” icon. This will open the pipeline configuration window.

 

 

 

Configure CloVR-Microbe on Roche/454 Sequence Data

Select the “Tag” corresponding to your SFF (e.g. “CloVRMicrobe_Test_SFF”) file from the list in the “Select Sequencing Dataset” drop-down menu. Choose “Assembly+Annotation” as the “CloVR Microbe Track”.

If cloud credentials have been added to the CloVR VM, they can be selected in the “Account” drop-down menu. It is not recommended to select “local”, unless your local computer setup provides multi-CPU and at least 4GB of RAM support for the CloVR VM. Provide a name to recognize your pipeline in the dashboard “Home” page as “Pipeline Description”, e.g. “CloVRMicrobe_454Test_1″.

Specify the “Organism” with a two part name at the taxonomic species level, e.g. “Acinetobacter baylyi”. Modify additional assembly and annotation settings as needed. Most parameters refer to the SffToCA utility of the Celera assembler, which converts sequence reads from the SFF format to the Celera assembler FRG format. Provide a unique “Pipeline Description”.

 

 

 

Configure CloVR-Microbe on Illumina Sequence Data

Select the “Tag” corresponding to your FASTQ files and “Assembly+Annotation” as the “CloVR Microbe Track” as described for the Roche/454 sequence data above. A separate window will open that allows you to specify the characteristics of the FASTQ sequence data.

Select “Short” “Read Length” for Illumina sequence data and “Long” for Sanger sequence data.

 

Check your input by clicking “Validate”. If the validation is successful, start the pipeline by clicking “Submit”. After successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running”

 

 

 

 

4. Monitoring your pipeline

The pipeline status, i.e. the number of steps completed, is shown for each running pipeline. Further information can be accessed clicking on the pipeline name, which opens the “Pipeline Information” window.

 

 

Clicking on the [Pipeline #] headers in the “Pipeline Information” window will open the Ergatis “Workflow creation and monitoring interface” in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs. Each protocol consists of an outer wrapper pipeline, which always runs locally, and an inner pipeline (show in parentheses in the “Pipeline Information” window), which runs locally or on the cloud depending on the pipeline configuration.

Downloading Pipeline Output

Once the CloVR-Microbe run has completed, multiple results files are created in the “output” directory of the “Pipeline Information” window. If CloVR-Microbe is run on the cloud, all results files will be downloaded into the same folder. The path to this folder should look like this:

/clovr-standard-2011-08-25-05-13-27/shared/output/

All results files are created as compressed archives (.tar.gz), which can be extracted using the Finder in Mac OS X or the Tar utility in Unix or programs, such as WinZip or WinRAR, in Windows.

In addition, all results files are also available for download from the web interface, using the “Output” tab from the “Pipeline Information” window, which is accessible by clicking on the completed pipeline name.

Output

The outputs are:

Output Description
assembly_scaffold Sequence assembly with those contigs for which paired-end read information is available concatenated via linker sequence (filename: asmbl.scf.fasta)
assembly_qc Assembly quality control file, output from Celera Assembler (filename: asmbl.qc)
annotation_sequin_input File with scaffold annotations as input for Sequin NCBI sequence submission tool (filenames: scf.xxx.sqn)
annotation_genbank GenBank files of annotated scaffolds (filenames: scf.xxx.gbf)
features_CDS Nucleotide fasta files of all coding sequences (filenames: asmbl.CDS.xxx.fsa)
features_proteins Protein fasta files of all coding sequences (filenames: asmbl.polypeptide.xxx.fsa)