To run CloVR-Microbe version 2.0 cloud support is recommended. Cloud use is optional and CloVR-Microbe as any other CloVR protocol can be run on a local computer. However, several steps of the protocol either have high RAM requirements (assembly) or are computationally extensive (annotation), which can make local execution practically impossible due to long runtimes (e.g. weeks for a standard E. coli genome project). BLAST and HMMer searches that are part of the annotation component of CloVR-Microbe benefit significantly from parallelization across multiple processors on the Cloud. Assembly of 454 sequence data with the Celera assembler requires 4GB, whereas RAM in excess of 2-3GB is required for the assembly of Illumina sequence data with SPAdes.
Total runtimes of the CloVR-Microbe depend mostly on the number of predicted protein-coding genes that require functional annotation, which can vary significantly based on the assembly output. A typical run of CloVR-Microbe (e.g. a standard E. coli genome consisting of <100 contigs and a total length of <5 Mbp), independently of the type of input sequence data, usually finishes in under 24 hours. See our publication in PLoS ONE for details.
To use the cloud to run CloVR protocols, you must obtain credentials from one of the supported cloud providers and CloVR must be configured to use these credentials. If you want to use the Amazon Elastic Compute Cloud (EC2), be sure to have configured your Amazon EC2 credentials. Usage on Amazon EC2 is charged per hour and care must be taken to terminate instances after a protocol has completed.
454: A single SFF file.
Illumina: FASTQ file(s)
Download Test Datasets and Output
- Single SFF file: CloVR-Microbe v1.0, 454 example input SFF file
- Paired-end FASTQ files: CloVR-Microbe v1.0 Illumina example input FASTQ files
Test datasets are .tar archives of SFF and FASTQ files and need to be extracted before use. Extracted datasets should be copied to the user_data folder in the CloVR VM. From within the VM, the shared folder is accessible as /mnt/user_data.
1. Starting the CloVR web interface
This step is identical on all platforms, please choose instructions for the corresponding platform you are running CloVR on:
2.Adding datasets to the CloVR VM
Before starting a pipeline, you must add your datasets to the CloVR VM as “Tags”. The easiest way to add data to the VM is by copying them into the user_data folder, which is a shared folder between the VM and your local computer. Check Troubleshooting CloVR on VirtualBox, if you have problems accessing files in the shared folders.
To add files, click “Add” on the web interface.
Add Roche/454 SFF Sequence Data
If your files are in the user_data directory, click on “Select file from image”, which will open a sub-window where you can selecte the SFF file for upload into the VM. Alternatively, you can use “Browse” in the “Upload File” window to find and select files from anywhere on your local computer but multiple files have to be uploaded in separate steps.
Select “Nucleotide SFF” from the “File Type” drop-down menu and name your dataset, e.g. as “CloVRMicrobe_Test_SFF”. This name will appear as a “Tag” on the CloVR web interface. Multiple files will be listed under the same “Tag” name. Add an optional description of your dataset. Click “Tag” to upload the data to CloVR.
A “Completed Successfully” window should appear to indicate that your datasets were added to the CloVR VM and the new dataset should be listed under “Data Sets” on the web interface.
Add Illumina FASTQ Sequence Data
If your files are in the user_data directory, click on “Select file from image”, which will open a sub-window where you can selecte FASTQ files for upload into the VM. Illumina paired-end sequence data usually consist in two separate files. FASTQ files with corresponding paired-end reads can be tagged in one step.
In the CloVR VM, each dataset is listed with the “Tag” name that was specified as “Name” when the dataset was added. Clicking on the newly added “Tag” opens a new window with information about the associated dataset.
Compressed or uncompressed archives are uploaded in the same way.
3. Configuring and starting the pipeline
To initialize a new pipeline run, select the “Tag” corresponding to your SFF or FASTQ file(s) in the “Data Sets” window and click on the “CloVR Microbe” icon. This will open the pipeline configuration window.
Configure CloVR-Microbe on Roche/454 Sequence Data
Select the “Tag” corresponding to your SFF (e.g. “CloVRMicrobe_Test_SFF”) file from the list in the “Select Sequencing Dataset” drop-down menu. Choose “Assembly+Annotation” as the “CloVR Microbe Track”.
If cloud credentials have been added to the CloVR VM, they can be selected in the “Account” drop-down menu. It is not recommended to select “local”, unless your local computer setup provides multi-CPU and at least 4GB of RAM support for the CloVR VM. Provide a name to recognize your pipeline in the dashboard “Home” page as “Pipeline Description”, e.g. “CloVRMicrobe_454Test_1”.
Specify the “Organism” with a two part name at the taxonomic species level, e.g. “Acinetobacter baylyi”. Modify additional assembly and annotation settings as needed. Most parameters refer to the SffToCA utility of the Celera assembler, which converts sequence reads from the SFF format to the Celera assembler FRG format. Provide a unique “Pipeline Description”.
“Database Name”, “Manatee Username”, and “Manatee Password” are fine to leave as they are if only a single CloVR Microbe pipeline needs to be run. However if you anticipate running multiple annotations on this CloVR instance, you will need to set a unique “Database Name” for each CloVR Microbe run. Remember this information if you plan on transferring data to a Manatee virtual machine for future viewing and annotation curation.
Configure CloVR-Microbe on Illumina Sequence Data
Select the “Tag” corresponding to your FASTQ files and “Assembly+Annotation” as the “CloVR Microbe Track” as described for the Roche/454 sequence data above. Illumina FASTQ data will use the SPAdes assembler and currently no additional options need to be adjusted. The wrapper script for the SPAdes assembler will know if the applied tag is a single-end or paired-end read.
Specify “Credentials” and “Organism” as described for the Roche/454 sequence section above. Same goes for the “Manatee Username”, “Manatee Password”, and “Database Name”
Check your input by clicking “Validate”. If the validation is successful, start the pipeline by clicking “Submit”. After successful pipeline submission, the web interface will change to the “Home” page where the new pipeline will be listed as “Status: running”
4. Monitoring your pipeline
The pipeline status, i.e. the number of steps completed, is shown for each running pipeline. Further information can be accessed clicking on the pipeline name, which opens the “Pipeline Information” window.
Clicking on the [Pipeline #] headers in the “Pipeline Information” window will open the Ergatis “Workflow creation and monitoring interface” in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs. Each protocol consists of an outer wrapper pipeline, which always runs locally, and an inner pipeline (show in parentheses in the “Pipeline Information” window), which runs locally or on the cloud depending on the pipeline configuration.
Downloading Pipeline Output
Once the CloVR-Microbe run has completed, multiple result files are created in the “output” directory of the “Pipeline Information” window.
All result files are available for download from the web interface, using the “Output” tab from the “Pipeline Information” window, which is accessible by clicking on the completed pipeline name.
The outputs are:
|assembly_scaffold||Sequence assembly with those contigs for which paired-end read information is available concatenated via linker sequence (filename: asmbl.scf.fasta)|
|assembly_qc||Assembly quality control file, output from Celera Assembler (filename: asmbl.qc)|
|annotation_sequin_input||File with scaffold annotations as input for Sequin NCBI sequence submission tool (filenames: scf.xxx.sqn)|
|annotation_genbank||GenBank files of annotated scaffolds (filenames: scf.xxx.gbf)|
|features_CDS||Nucleotide fasta files of all coding sequences (filenames: asmbl.CDS.xxx.fsa)|
|features_proteins||Protein fasta files of all coding sequences (filenames: asmbl.polypeptide.xxx.fsa)|
|CDS database via formatdb||Nucleotide database files created by formatdb used for BLASTN searches in Manatee|
|Protein database via formatdb||Polypeptide database files created by formatdb used for BLASTP searches in Manatee|
|Genome database via formatdb||Nucleotide scaffold database files created by formatdb used for TBLASTN searches in Manatee|
|Circleator files||Circular plots of genome-associated data (in PDF and JPG)|
|BER files||Output from the Blast-Extend Repraze utility for both pre- and post-overlap analysis. (RAW is in .nr and btab is in .nr.btab)|
|Annotation Database||Chado SQL database of annotations that can be uploaded to Manatee (a SQL dump file)|