Monitoring Pipelines with Ergatis & Ganglia

Ergatis

Ergatis is a web-based workflow management system designed to monitor XML-formatted templates on a compute grid. Once a pipeline is running on a cluster, its status may be monitored by navigating to the associated Ergatis link.

From here clicking on the “clovr” link on the left hand side (as seen above) will bring you to a list of pipelines running:

Clicking on the number found under the id column or the view button (not pictured) to the far right will take you to a screen with more details concerning your pipeline. Navigating to this detailed screen provides a breakdown of each step in the pipeline (components) and allows you to view progress, what is being piped to standard out and standard error and the configuration for the component.

Clicking on the “view” button of any of the components takes us to a detailed screen showing progress of each individual step within the component:

Clicking any of the “show info” links will display the command that was executed as well as give access to any content written to standard out or standard error. Pipeline monitoring is made easy through the use of the Ergatis interface, which allows the user to know precisely which component of the analysis pipeline is currently being active and provides detailed information on each specific analysis step, which can be used to tweak parameters in any subsequent run. If any step in the pipeline fails, you can extract error messages and command executions with this tool.

Ganglia

Ganglia is another web-based tool designed to monitor a cluster’s size and use over time. It can be very useful to find out how much of the cluster your pipeline is actually using. A typical window in Ganglia looks like:

Here we see on the left side of the window the number of hosts up and running the cluster along with the total number of CPUs. Ganglia will also show any hosts that fail and average loads across the cluster.

Four plots in Ganglia describe different aspects of the cluster:

1. The upper left plot shows the cluster size in CPUs (red), the number of hosts (green), and the number of processes running on the cluster (blue) over time.

2. The upper left plot shows the percentage of CPUs in the cluster that is being utilized.

3. The lower left plot displays information about how much of the clusters memory is used at any time.

4. The final plot in the lower right displays transfer rates in and out of the cluster. As example, when a dataset is being uploaded to the cluster there will be a spike in the green signal. When a set of results are being downloaded to the local master node, the purple signal will increase until the transfer is complete.