Data Transformations
Choice depends on data set!
- Center and standardize
- Center: subtract from each value the mean of the corresponding vector
- Standardize: devide by standard deviation
- Result: Mean = 0 and STDEV = 1
- Center and scale with the
scale()
function- Center: subtract from each value the mean of the corresponding vector
- Scale: divide centered vector by their root mean square (rms):
- Result: Mean = 0 and STDEV = 1
- Log transformation
- Rank transformation: replace measured values by ranks
- No transformation
Distance Methods
List of most common ones!
- Euclidean distance for two profiles X and Y:
- Disadvantages: not scale invariant, not for negative correlations
- Maximum, Manhattan, Canberra, binary, Minowski, …
- Correlation-based distance: 1-r
- Pearson correlation coefficient (PCC):
- Disadvantage: outlier sensitive
- Spearman correlation coefficient (SCC)
- Same calculation as PCC but with ranked values!
- Pearson correlation coefficient (PCC):
There are many more distance measures
- If the distances among items are quantifiable, then clustering is possible.
- Choose the most accurate and meaningful distance measure for a given field of application.
- If uncertain then choose several distance measures and compare the results.