This vignette will go over the steps required to implement a custom
user-defined function within the bestNormalize
framework.
There are 3 steps.
Create transformation function
Create predict method for transformation function (that can be applied to new data)
Pass through new function and predict method to bestNormalize
Here, we start by defining a new function that we’ll call
cuberoot_x
, which will take an argument a
(as
does the sqrt_x
function) which will try to add a constant
if it sees any negative numbers in x
. It will also take the
argument standardize
which will center and scale the
transformed data so that it’s centered at 0 with SD = 1.
## Define user-function
cuberoot_x <- function(x, a = NULL, standardize = TRUE, ...) {
stopifnot(is.numeric(x))
min_a <- max(0, -(min(x, na.rm = TRUE)))
if(!length(a))
a <- min_a
if(a < min_a) {
warning("Setting a < max(0, -(min(x))) can lead to transformation issues",
"Standardize set to FALSE")
standardize <- FALSE
}
x.t <- (x + a)^(1/3)
mu <- mean(x.t, na.rm = TRUE)
sigma <- sd(x.t, na.rm = TRUE)
if (standardize) x.t <- (x.t - mu) / sigma
# Get in-sample normality statistic results
ptest <- nortest::pearson.test(x.t)
val <- list(
x.t = x.t,
x = x,
mean = mu,
sd = sigma,
a = a,
n = length(x.t) - sum(is.na(x)),
norm_stat = unname(ptest$statistic / ptest$df),
standardize = standardize
)
# Assign class
class(val) <- c('cuberoot_x', class(val))
val
}
Note that we assigned a class to the object this returns of the same
name; this is necessary for successful implementation within
bestNormalize
. We’ll also need an associated
predict
method that is used to apply the transformation to
newly observed data. `
predict.cuberoot_x <- function(object, newdata = NULL, inverse = FALSE, ...) {
# If no data supplied and not inverse
if (is.null(newdata) & !inverse)
newdata <- object$x
# If no data supplied and inverse
if (is.null(newdata) & inverse)
newdata <- object$x.t
# Actually performing transformations
# Perform inverse transformation as estimated
if (inverse) {
# Reverse-standardize
if (object$standardize)
newdata <- newdata * object$sd + object$mean
# Reverse-cube-root (cube)
newdata <- newdata^3 - object$a
# Otherwise, perform transformation as estimated
} else if (!inverse) {
# Take cube root
newdata <- (newdata + object$a)^(1/3)
# Standardize to mean 0, sd 1
if (object$standardize)
newdata <- (newdata - object$mean) / object$sd
}
# Return transformed data
unname(newdata)
}
This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.
print.cuberoot_x <- function(x, ...) {
cat(ifelse(x$standardize, "Standardized", "Non-Standardized"),
'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n',
'Relevant statistics:\n',
'- a =', x$a, '\n',
'- mean (before standardization) =', x$mean, '\n',
'- sd (before standardization) =', x$sd, '\n')
}
Note: if you can find a similar transformation in the source code,
it’s easy to model your code after it. For instance, for
cuberoot_x
and predict.cuberoot_x
, I used
sqrt_x.R
as a template file.
# Store custom functions into list
custom_transform <- list(
cuberoot_x = cuberoot_x,
predict.cuberoot_x = predict.cuberoot_x,
print.cuberoot_x = print.cuberoot_x
)
set.seed(123129)
x <- rgamma(100, 1, 1)
(b <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE))
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.2347
## - Box-Cox: 1.0267
## - cuberoot_x: 0.9787
## - Double Reversed Log_b(x+a): 2.4507
## - Exp(x): 4.7947
## - Log_b(x+a): 1.3547
## - No transform: 2.0027
## - orderNorm (ORQ): 1.1627
## - sqrt(x + a): 1.0907
## - Yeo-Johnson: 1.0987
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Non-Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - a = 0
## - mean (before standardization) = 0.9588261
## - sd (before standardization) = 0.3298665
Evidently, the cube-rooting was the best normalizing transformation!
The bestNormalize package can estimate any univariate statistic using
its CV framework. A user-defined function can be passed in through the
norm_stat_fn
argument, and this function will then be
applied in lieu of the Pearson test statistic divided by its degree of
freedom.
The user-defined function must take an argument x
, which
indicates the data on which a user wants to evaluate the statistic.
Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic:
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.1958
## - Box-Cox: 0.1785
## - Center+scale: 0.2219
## - Double Reversed Log_b(x+a): 0.2624
## - Exp(x): 0.3299
## - Log_b(x+a): 0.1959
## - orderNorm (ORQ): 0.186
## - sqrt(x + a): 0.1829
## - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.3281193
## - mean (before standardization) = -0.1263882
## - sd (before standardization) = 0.9913552
Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test’s p-value:
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.4327
## - Box-Cox: 0.4831
## - Center+scale: 0.2958
## - Double Reversed Log_b(x+a): 0.2028
## - Exp(x): 0.0675
## - Log_b(x+a): 0.3589
## - orderNorm (ORQ): 0.4492
## - sqrt(x + a): 0.4899
## - Yeo-Johnson: 0.4531
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized exp(x) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - mean (before standardization) = 6.885396
## - sd (before standardization) = 13.66084
Note: bestNormalize
will attempt to minimize this
statistic by default, which is definitely not what you want to do when
calculating the p-value. This is seen in the example above, as the WORST
normalization transformation is chosen.
In this case, a user is advised to either manually select the best one:
best_transform <- names(which.max(dont_do_this$norm_stats))
(do_this <- dont_do_this$other_transforms[[best_transform]])
## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - a = 0
## - mean (before standardization) = 0.9811849
## - sd (before standardization) = 0.4779252
Or, the user can reverse their defined statistic (in this case by subtracting it from 1):
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.5166
## - Box-Cox: 0.4191
## - Center+scale: 0.6521
## - Double Reversed Log_b(x+a): 0.8005
## - Exp(x): 0.9601
## - Log_b(x+a): 0.5338
## - orderNorm (ORQ): 0.4646
## - sqrt(x + a): 0.4475
## - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.3281193
## - mean (before standardization) = -0.1263882
## - sd (before standardization) = 0.9913552