Skiff

Skiff is a visualization tool developed for unsupervised clustering of samples based on a set of features with associated counts. The data depends on the CloVR-track employed, but a few examples include:

  • Reads with COG-based annotations from multiple metagenomic samples
  • 16S rRNA sequences from a multiplexed 454 run and phylum-level annotations from the RDP Bayesian classifier
  • Reads from several metagenomic samples with species-level annotations based on BLASTN searches against the NCBI RefSeq database

Skiff takes as input a tab-delimited matrix (samples <=> columns, features <=> rows) and clusters the data to discover any natural tendencies. Specifically skiff normalizes the data within each sample and computes Euclidean distances between row/column vectors. It then uses furthest-neighbor clustering to construct dendrograms of all features and samples. A corresponding heatmap is generated to display enriched or depleted features within each sample. Skiff uses the R packages gplots and RColorBrewer to do this:

The result is a lot of information that can be quickly assessed by eye. Each sample & feature is labeled in the heatmap for inspection:

skiffheader

Normalization

Skiff normalizes an input matrix in two different ways. First, all values are transformed to proportions within each sample. Then, to provide an alternative weighting scheme, the logarithm of all proportions is also taken. Clustering is performed on both transformed datasets and pdf files are output (*.proportions.pdf & *.lognormalized.pdf). As shown below, there can be dramatic differences seen depending on the type of normalization used:

skiff proportions versus logarithms of proportions

High density clusterings

Finally, skiff is capable of rapidly producing large clusterings with up to hundreds or even thousands of features. Though the labeling information is often lost in the image due to the high density of cells in the heatmap, one may potentially use these clusterings to discover broad differences between populations and perform a more detailed analysis with corresponding output text tables: