Ragged RR

Greg Hunt

2020-05-25

RR overview

The Robust re-scaling transformation (RR) is a transformation the help reveal latent structure in data. It uses three steps to transform the data:

  1. Gaussianize the data with a consensus box-cox-like transformation
  2. z-score Transform the data using robust estimates of the mean and s.d.
  3. remove extreme outliers from the data setting them to ‘NA’

The sequence of these transformations helps focus classic statistical analyses on consequential variance in the data rather than having the analyses be dominated by variation resulting from measurement scale or outliers.

If you have not already read the basic vignette “Rescaling Data” that is recommend first.

Typically, the input to RR is a matrix or data.frame, the output is a matrix or data.frame of the same size, but with re-scaled values. However, in this vignette we will explore how RR scale may also be used for ragged matrices, data frames, or lists.

RR For Ragged Data

First let’s create some ragged data. We will generate data that cannot be put into a matrix since each of the observations is of different length:

set.seed(12345)
N = rpois(10,20)
data = lapply(N,rexp)
str(data)
## List of 10
##  $ : num [1:22] 0.3971 0.0881 1.7222 5.3849 1.888 ...
##  $ : num [1:23] 0.152 0.104 0.35 1.185 1.036 ...
##  $ : num [1:19] 1.284 0.871 0.235 0.361 0.275 ...
##  $ : num [1:17] 0.6 0.129 0.549 2.843 0.344 ...
##  $ : num [1:30] 0.591 1.101 2.055 0.57 0.366 ...
##  $ : num [1:22] 0.112 0.303 0.877 0.391 1.204 ...
##  $ : num [1:18] 0.3513 0.2334 0.5813 2.102 0.0117 ...
##  $ : num [1:15] 2.646 3.692 0.797 1.338 2.147 ...
##  $ : num [1:17] 0.417 0.307 0.155 0.357 2.066 ...
##  $ : num [1:21] 1.656 0.473 0.35 2.121 0.459 ...

We can still pass this to RR and have it transformed

library('rrscale')
rr.out = rrscale(data)

notice that the output of rrscale takes the same form as the input data. In this case it is a list of 10 sets of numbers:

str(rr.out$RR)
## List of 10
##  $ : num [1:22] -0.388 -1.383 0.996 2.46 1.099 ...
##  $ : num [1:23] -1.064 -1.293 -0.485 0.596 0.461 ...
##  $ : num [1:19] 0.679 0.293 -0.777 -0.461 -0.665 ...
##  $ : num [1:17] -0.0454 -1.1622 -0.1232 1.5906 -0.499 ...
##  $ : num [1:30] -0.0587 0.5217 1.197 -0.091 -0.4509 ...
##  $ : num [1:22] -1.247 -0.595 0.299 -0.4 0.612 ...
##  $ : num [1:18] -0.4827 -0.7804 -0.0734 1.2237 -2.2642 ...
##  $ : num [1:15] 1.501 1.93 0.209 0.722 1.248 ...
##  $ : num [1:17] -0.35 -0.584 -1.052 -0.471 1.204 ...
##  $ : num [1:21] 0.952 -0.246 -0.486 1.234 -0.271 ...

We can compare the untransformed and the transformed data:

library('ggplot2')
library('reshape2')
par(mfrow=c(2,1))
df = data.frame(untrans=unlist(data),rr=unlist(rr.out$RR))
df = melt(df,measure.vars = 1:2)
ggplot(data=df,mapping=aes(x=value,fill=variable))+geom_histogram()+facet_wrap(~variable)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also still use the transformation function to transform data as previously. For example, if we only want to apply the “G”-step, we can call:

g_only = rr.out$rr(data,G=TRUE,Z=FALSE,O=FALSE)
str(g_only)
## List of 10
##  $ : num [1:22] -0.829 -1.843 0.58 2.072 0.686 ...
##  $ : num [1:23] -1.5182 -1.7509 -0.9277 0.1731 0.0359 ...
##  $ : num [1:19] 0.258 -0.136 -1.226 -0.904 -1.111 ...
##  $ : num [1:17] -0.48 -1.618 -0.559 1.187 -0.942 ...
##  $ : num [1:30] -0.4938 0.0976 0.7856 -0.5267 -0.8934 ...
##  $ : num [1:22] -1.704 -1.04 -0.129 -0.841 0.19 ...
##  $ : num [1:18] -0.926 -1.229 -0.509 0.813 -2.741 ...
##  $ : num [1:15] 1.095 1.533 -0.221 0.301 0.838 ...
##  $ : num [1:17] -0.79 -1.029 -1.505 -0.914 0.792 ...
##  $ : num [1:21] 0.536 -0.685 -0.929 0.823 -0.71 ...