This walkthrough provides a simple example of how to set-up and run the Human Contaminant Screening pipeline using the web-browser accessible CloVR dashboard. The pipeline uses the NCBI BMTagger (Best Match Tagger) tool to identify and remove human reads in metagenomic sequences. For this walkthrough, we shall utilize a mock datasetÂ which consists of 50:50 mix of human contaminant-screened reads from an HMP project and filtered human reads from a 1000 genomes project.
This walkthrough uses the cloud for computational support. Specifically, we demonstrate how to run the pipeline using the academic cloudÂ DIAG, which is free for researchers. Alternatively, you could run the pipeline on Amazon EC2 or using otherÂ cloud computing providers.
The Human Contaminant Screening SOPÂ provides aÂ detailed description of this pipeline.
Note: This pipeline has high memory requirements. The underlying programs require about 8.5Gb memory and threeÂ times as much harddisk space for index data.Â Disk space needed for temporary files depends on input, and is typically the same size asÂ that of the input for metagenomic datasets. If your local machine does not have this capacity, consider running the pipeline onÂ DIAGÂ or Amazon EC2.
Getting started with CloVR
Installing and setting up CloVR is a one-time process. If you have done this before, you may skip to the next step â€“ Setting up input dataset.
CloVR is run using a local desktop client.Â Visit theÂ Getting started with CloVR page to download and install the client. Once the CloVR virtual machine is set up and launched, you should see a screen similar to Figure 1.
Start the CloVR web interface
First check the CloVR desktop window for the IP address of your virtual machine (VM). Then enter this IP address in a web browser as shown in Figure 2.
Add cloud credentials to the pipeline
Visit the Adding Credentials page for steps on how to add DIAG credentials. Once the your DIAG credentials are setup, you should see it listed within the credentials tab as shown in Figure 3.
Setting up input dataset
Prepare input datasets
The input to this pipeline can be either FASTA or FASTQÂ read files. Also, input could be either a single-end reads file or set of paired-end reads files.
The first step is to move your input data file(s) into the user_data folder located within the clovr-standard-* image directory. This will enable you to easily access the data through the CloVR dashboard.
Add datasets to the pipeline
Before starting a pipeline, you must add your datasets to the CloVR VM as â€œTagsâ€.Â To add tags, click â€œAddâ€ on the web interface.
Then click on â€œSelect file from imageâ€, which will open a sub-window where you can select one or multiple FASTA or FASTQ files for upload into the VM. For a paired-end dataset, you should select exactly two files – the files will be screened as pair. For a single-end dataset, you could select one or multiple files. If you select multiple files, each file will be screened individually.
Alternatively, you can use â€œBrowseâ€ in the â€œUpload Fileâ€ window to find and select files from anywhere on your local computer, but multiple files have to be uploaded in separate steps.
Select â€œNucleotide FASTAâ€ Â or “Nucleotide FASTQ” (depending of your file type) from the â€œFile Typeâ€ drop-down menu and name your dataset, e.g. as â€œHCS_paired_1â€. Add an optional description of your dataset. Click â€œTagâ€ to upload the data to CloVR. A â€œCompleted Successfullyâ€ window should appear to indicate that your dataset was added to the CloVR VM.
The new dataset will be listed under the â€œData Setsâ€ tab on the CloVR web interface. Multiple files will listed under the same â€œTagâ€ name.
Pipeline setup and execution
To initialize a new pipeline run, click on the “Other Protocols” drop-down as shown in the figure below. Then select “clovr_human_contaminant_screening_paired” (for paired-end dataset) or “clovr_human_contaminant_screening_single” (for single-end dataset).
This will open the pipeline configuration window. Select the Tag corresponding to the input dataset file(s). Select “fasta” or “fastq” as the input format.
Select â€œDIAGâ€ credentials from the â€œAccountâ€ drop-down menu.
Provide a name to recognize your pipeline in the web interface home page as â€œPipeline Descriptionâ€, e.g. â€œhcs_paired_test1â€³.
Check your input by clicking â€œvalidateâ€. If the validation is successful, start the pipeline by clicking â€œRunâ€.
After a successful pipeline submission, the web interface will change to the â€œHomeâ€ page where the new pipeline will be listed as â€œStatus: running.â€
Monitoring the pipeline
Your pipeline should now appear in the Pipelines window in the CloVR dashboard along with its status. Occasionally, the pipeline may idle for a minute or two before running. You can click on the pipeline to get a description, input parameters, and hyperlinks to more advanced workflow interfaces like Ergatis.Â Clicking on the [Pipeline #] headers in the â€œPipeline Informationâ€ window will open the Ergatis â€œWorkflow creation and monitoring interfaceâ€ in a separate browser window, which provides useful information for troubleshooting of failed pipeline runs.
Accessing the outputs
Once the pipeline completes, the results can be downloaded from this CloVR dashboard by clicking on the Outputs tab (see Figure below). All results files are created as compressed archives (.tar.gz), which can be extracted using the Finder in Mac OS X, the Tar utility in Unix or programs such as WinZip or WinRAR, in Windows.
Examiming the outputs
The CloVR-Human Contaminant Screening pipeline outputs the following files:
|screened_files||Output fasta or fastq file(s) resulting after human sequences have been removed.|
|screened_ids||A list of the sequence IDs that were removed.|