--- title: "Customization within bestNormalize" author: "Ryan A Peterson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Customization within bestNormalize} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.height = 5, fig.width = 7) library(bestNormalize) ``` # Custom functions with bestNormalize This vignette will go over the steps required to implement a custom user-defined function within the `bestNormalize` framework. There are 3 steps. 1) Create transformation function 2) Create predict method for transformation function (that can be applied to new data) 3) Pass through new function and predict method to bestNormalize ## Example: cube-root ## S3 methods Here, we start by defining a new function that we'll call `cuberoot_x`, which will take an argument `a` (as does the `sqrt_x` function) which will try to add a constant if it sees any negative numbers in `x`. It will also take the argument `standardize` which will center and scale the transformed data so that it's centered at 0 with SD = 1. ```{r} ## Define user-function cuberoot_x <- function(x, a = NULL, standardize = TRUE, ...) { stopifnot(is.numeric(x)) min_a <- max(0, -(min(x, na.rm = TRUE))) if(!length(a)) a <- min_a if(a < min_a) { warning("Setting a < max(0, -(min(x))) can lead to transformation issues", "Standardize set to FALSE") standardize <- FALSE } x.t <- (x + a)^(1/3) mu <- mean(x.t, na.rm = TRUE) sigma <- sd(x.t, na.rm = TRUE) if (standardize) x.t <- (x.t - mu) / sigma # Get in-sample normality statistic results ptest <- nortest::pearson.test(x.t) val <- list( x.t = x.t, x = x, mean = mu, sd = sigma, a = a, n = length(x.t) - sum(is.na(x)), norm_stat = unname(ptest$statistic / ptest$df), standardize = standardize ) # Assign class class(val) <- c('cuberoot_x', class(val)) val } ``` Note that we assigned a class to the object this returns of the same name; this is necessary for successful implementation within `bestNormalize`. We'll also need an associated `predict` method that is used to apply the transformation to newly observed data. ` ```{r} predict.cuberoot_x <- function(object, newdata = NULL, inverse = FALSE, ...) { # If no data supplied and not inverse if (is.null(newdata) & !inverse) newdata <- object$x # If no data supplied and inverse if (is.null(newdata) & inverse) newdata <- object$x.t # Actually performing transformations # Perform inverse transformation as estimated if (inverse) { # Reverse-standardize if (object$standardize) newdata <- newdata * object$sd + object$mean # Reverse-cube-root (cube) newdata <- newdata^3 - object$a # Otherwise, perform transformation as estimated } else if (!inverse) { # Take cube root newdata <- (newdata + object$a)^(1/3) # Standardize to mean 0, sd 1 if (object$standardize) newdata <- (newdata - object$mean) / object$sd } # Return transformed data unname(newdata) } ``` ## Optional: print method This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function. ```{r} print.cuberoot_x <- function(x, ...) { cat(ifelse(x$standardize, "Standardized", "Non-Standardized"), 'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n', 'Relevant statistics:\n', '- a =', x$a, '\n', '- mean (before standardization) =', x$mean, '\n', '- sd (before standardization) =', x$sd, '\n') } ``` Note: if you can find a similar transformation in the source code, it's easy to model your code after it. For instance, for `cuberoot_x` and `predict.cuberoot_x`, I used `sqrt_x.R` as a template file. ## Implementing with bestNormalize ```{r} # Store custom functions into list custom_transform <- list( cuberoot_x = cuberoot_x, predict.cuberoot_x = predict.cuberoot_x, print.cuberoot_x = print.cuberoot_x ) set.seed(123129) x <- rgamma(100, 1, 1) (b <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE)) ``` Evidently, the cube-rooting was the best normalizing transformation! ## Sanity check Is this code actually performing the cube-rooting? ```{r} all.equal(x^(1/3), b$chosen_transform$x.t) all.equal(x^(1/3), predict(b)) ``` It does indeed. # Using custom normalization statistics The bestNormalize package can estimate any univariate statistic using its CV framework. A user-defined function can be passed in through the `norm_stat_fn` argument, and this function will then be applied in lieu of the Pearson test statistic divided by its degree of freedom. The user-defined function must take an argument `x`, which indicates the data on which a user wants to evaluate the statistic. Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic: ```{r} bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat) ``` Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test's p-value: ```{r} (dont_do_this <- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p)) ``` Note: `bestNormalize` will attempt to minimize this statistic by default, which is definitely not what you want to do when calculating the p-value. This is seen in the example above, as the WORST normalization transformation is chosen. In this case, a user is advised to either manually select the best one: ```{r} best_transform <- names(which.max(dont_do_this$norm_stats)) (do_this <- dont_do_this$other_transforms[[best_transform]]) ``` Or, the user can reverse their defined statistic (in this case by subtracting it from 1): ```{r} (do_this <- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p)) ```