Data Transformations

Choice depends on data set!

  • Center and standardize
    1. Center: subtract from each value the mean of the corresponding vector
    2. Standardize: devide by standard deviation
      • Result: Mean = 0 and STDEV = 1
  • Center and scale with the scale() function
    1. Center: subtract from each value the mean of the corresponding vector
    2. Scale: divide centered vector by their root mean square (rms):
      • Result: Mean = 0 and STDEV = 1
  • Log transformation
  • Rank transformation: replace measured values by ranks
  • No transformation

Distance Methods

List of most common ones!

  • Euclidean distance for two profiles X and Y:
    • Disadvantages: not scale invariant, not for negative correlations
  • Maximum, Manhattan, Canberra, binary, Minowski, …
  • Correlation-based distance: 1-r
    • Pearson correlation coefficient (PCC):
      • Disadvantage: outlier sensitive
    • Spearman correlation coefficient (SCC)
      • Same calculation as PCC but with ranked values!

There are many more distance measures

  • If the distances among items are quantifiable, then clustering is possible.
  • Choose the most accurate and meaningful distance measure for a given field of application.
  • If uncertain then choose several distance measures and compare the results.

Cluster Linkage

Jump to: next_page