Downloading Specific Sets of Output from a Pipeline

Introduction

In past CloVR images downloading output from a pipeline run was an all or nothing affair. The entire output_repository folder would be tarball’d up and shipped over to the local VM, usually packaging many intermediate and extraneous data that would not be needed. In order to address this a tag_data ergatis component has been created which can be executed as the last component in a pipeline and will tag specific files, file lists, or directory to be downloaded. This targeted approach solves the issue of downloading any data that is not valuable in analysis.

Creating a Download Map

The tag_data component requires a download map input file, a tab-delimited text file containing a desired tag name and one or many files, list files, or directories.

#TAG_NAME                   FILES
uclust_polypeptide_fasta    $;REPO_ROOT$;/output_repository/uclust/
                            $;PIPELINE_ID$;_polypeptide/uclust.fsa.list
blastp_btab                 $;REPO_ROOT$;/output_repository/ncbi-blastp/
                            $;PIPELINE_ID$;_default/ncbi-blastp.btab.list

The example above is the download map for the CloVR Total Metagenomics pipeline which downloads output from the uclust and ncbi-blastp components. The first column contains the desired tag name while the second column contains the path to the file lists produced by the corresponding components. The FILES column can contain two variables that will be replaced by the component, $;REPO_ROOT$; and$;PIPELINE_ID$;. Both these values can be configured in the component configuration and are meant to mimic their Ergatis counterparts.

[parameters]
;; This component tags data using the CloVR tagData.py script
;; Input is a hard-coded template file generated prior to the
;; execution of this component.
$;REPO_ROOT$; = $;REPOSITORY_ROOT$;
$;PIPELINE_ID$; = $;PIPELINEID$;

When building a pipeline template the REPO_ROOT and PIPELINE_ID parameters may extend the built-in Ergatis ones very easily as shown above. A download map file can also process more than one file, file list, directory by use of a comma-delimited list as seen below:

#TAG_NAME    FILES
test_file_tag    /path/to/file1,/path/to/file2,/path/to/file3
test_dir_tag   /path/to/dir1/,/path/to/dir2
test_mix_tag   /path/to/dir3,/path/to/file4,/path/to/file_list1

Configuring your pipeline to use the tag_data component

Once a download map has been created adding the tag_data component to an existing pipeline requires several changes to the template:

  • Add tag_data component to pipeline.layout file
  • Copy tag_data.config file to template folder

Updating pipeline.layout

The tag_data component should be added as the last step in your pipeline.layout file to ensure that any output to be tagged has already been generated by the time the component is executed

        <commandSet type="serial">
            <state>incomplete</state>
            <name>split_multifasta.multi</name>
        </commandSet>
        <commandSet type="serial">
            <state>incomplete</state>
            <name>ncbi-blastp.default</name>
         </commandSet>
        <commandSet type="serial">
            <state>incomplete</state>
            <name>tag_data.default</name>
        </commandSet>
    </commandSet>
</commandSetRoot>

Copying tag_data configuration file to template folder

The tag_data configuration file should be copied from the ‘docs’ folder of the Ergatis install into your pipeline template. Once copied it should be configured to make use of the REPO_ROOT and PIPELINE_IDvariables if necessary. Making use of the Ergatis built-in $;REPOSITORY_ROOT$; and $;PIPELINEID$; will ensure that the components configuration variable are always setup correctly.  The $;INPUT_FILE$; parameter should also be configured to point at the specific download map file generated in previous steps of this walkthrough.

[parameters]
;; This component tags data using the CloVR tagData.py script
;; Input is a hard-coded template file generated prior to the
;; execution of this component.
$;REPO_ROOT$; = $;REPOSITORY_ROOT$;
$;PIPELINE_ID$; = $;PIPELINEID$;

[INPUT]
$;INPUT_FILE$; = /opt/clovr/clovr_pipelines/workflow/
                 project_saved_templates/clovr_total_metagenomics/
                 clovr_total_metagenomics.download.map

Configuring the download_tag component in the postrun step

Once data has been tagged it must be downloaded using the downloadData.py vappio script. This can be accomplished within a pipeline by making use of the clovrdownload XML template. These XML template files must be included in the postrun XMl and configured properly.

<?xml version="1.0" encoding="UTF-8"?>
<commandSetRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                xsi:schemaLocation='commandSet.xsd'>
  <commandSet type="serial">
      <state>incomplete</state>
      <name>clovr_total_metagenomics.postrun</name>
      <INCLUDE file="$;DOCS_DIR$;/clovrdownload_tag_iterator_template.xml">
  </commandSet>
</commandSetRoot>

lolIn supporting downloading a subset of data that may have been tagged in the tag data step a $;TAGS_TO_DOWNLOAD$; parameter must be added to the pipeline configuration file

#########################################################
## Input information.
## Configuration options for the pipeline.
#########################################################
[input]
DB_PATH=ncbi-nr
SEQS_PER_FILE=5
## Should be set to the number of nodes available in cluster
TOTAL_FILES=1
NUM_SEQS=100
GROUP_COUNT=1

INPUT_FILES=/path/to/your/input

PIPELINE_NAME=clovr_total_metagenomics_run
INPUT_TAG=clovr_total_metagenomics_input

## Tags to download from pipeline
TAGS_TO_DOWNLOAD=uclust_polypeptide_fasta,blastp_btab

While in this example all tags are downloaded it is perfectly valid to download just a subset of all data that is tagged based on the contents of the tag map. A custom output directory can also be specified by configuring the OUTPUT_DIRECTORY parameter:

#########################################################
## Output info.
## Specifies where locally the data will end up and also
## logging information
#########################################################
[output]
OUTPUT_DIRECTORY=/mnt/output
log_file=/mnt/clovr_total_metagenomics.log
## the higher, the more output (3 = most verbose)
debug_level=3

Finally the clovrdownload iterator XML file must be reference in the pipeline config file as well:

#prestart,prerun,postrun are all run locally. Use noop.xml for no operation
#Prestart is run before cluster start
#Possible actions: tag input data and do QC metrics
PRESTART_TEMPLATE_XML=/opt/clovr_pipelines/workflow/project_saved_templates/
                                            clovr_total_metagenomics/clovr_total_metagenomics.prestart.xml
#Prerun is run after cluster start but before pipeline start
#Possible actions: tag and upload data sets to the cluster
PRERUN_TEMPLATE_XML=/opt/clovr_pipelines/workflow/project_saved_templates/
                                          clovr_total_metagenomics/clovr_total_metagenomics.prerun.xml
#Postrun is run after pipeline completion and after download data
#Possible actions: local a local database, web browser.reorganize data for local ergatis
POSTRUN_TEMPLATE_XML=/opt/clovr_pipelines/workflow/project_saved_templates/
        clovr_total_metagenomics/clovr_total_metagenomics.postrun.xml
DOWNLOAD_TAG_ITERATOR_XML=/opt/ergatis/docs/clovrdownload_tag.iterator.xml

With these changes in place a target download of specific data sets is possible in any CloVR pipeline.