Calculates leverage to identify outliers in high-dimensional data. Can compute multiple kinds of leverage at once.
pscrub_multi(
X,
projection = "ICA_kurt",
nuisance = "DCT4",
center = TRUE,
scale = TRUE,
comps_mean_dt = FALSE,
comps_var_dt = FALSE,
kurt_quantile = 0.99,
fusedPCA_kwargs = NULL,
get_dirs = FALSE,
full_PCA = FALSE,
get_outliers = TRUE,
cutoff = 4,
seed = 0,
verbose = FALSE
)
X | Wide numeric data matrix (\(T\) observations by \(V\) variables,
\(T << V\)). If |
---|---|
projection | Leverage works by projecting the data onto directions likely to contain outlier information. Choose at least one of the following:
where \(k\) is the number of components determined by PESEL, and \(k2\) is the number of principal components with above-average variance. Use |
nuisance | Nuisance signals to regress from each column of Detrending is highly recommended for time-series data, especially if there are many time points or evolving circumstances affecting the data. Additionally, if kurtosis is being used to select the projection directions, trends can induce positive or negative kurtosis, contaminating the connection between high kurtosis and outlier presence. Detrending should not be used with non-time-series data because the observations are not temporally related. Additional nuisance regressors can be specified like so:
|
center | Center the columns of the data by their medians, and scale the
columns of the data by their median absolute deviations (MADs)? Default: Note that centering and scaling occur after nuisance regression, so even if
|
scale | Center the columns of the data by their medians, and scale the
columns of the data by their median absolute deviations (MADs)? Default: Note that centering and scaling occur after nuisance regression, so even if
|
comps_mean_dt | Stabilize the mean and variance of each
projection component's timecourse prior to computing kurtosis and leverage?
These arguments should be Slow-moving mean and variance patterns in the components will interfere with
the roles of kurtosis and leverage in identifying outliers. While
Overall, we recommend enabling |
comps_var_dt | Stabilize the mean and variance of each
projection component's timecourse prior to computing kurtosis and leverage?
These arguments should be Slow-moving mean and variance patterns in the components will interfere with
the roles of kurtosis and leverage in identifying outliers. While
Overall, we recommend enabling |
kurt_quantile | What quantile cutoff should be used to select the
components? Default: We model each component as a length $T$ vector of Normal iid random variables, for which the distribution of kurtosis values can be approximated. The quantile is estimated based on this distribution. |
fusedPCA_kwargs | Arguments to
|
get_dirs | Do the projection directions need to be returned? This is the
\(V\) matrix in PCA and \(S\) matrix in ICA. The default is |
full_PCA | Only applies to the PCA projection. Return the full SVD?
Default: |
get_outliers | Should outliers be flagged based on |
cutoff | Median leverage cutoff value. Default: |
seed | Set a seed right before the call to |
verbose | Should occasional updates be printed? Default: |
A "pscrub_multi"
object, i.e. a list with components
A \(T\) by \(P\) data.frame of numeric leverage values, each column being the leverage values for a projection method in projection
.
A data.frame with \(P\) rows listing information about each projection used.
A \(1\) by \(P\) data.frame of numeric outlier cutoff values for each projection (cutoff
times the median leverage).
A \(T\) by \(P\) data.frame of logical values where TRUE
indicates where leverage exceeds the cutoff, signaling suspected outlier presence.
A length \(P\) numeric vector corresponding to the data locations in X
. Each value indicates whether the location was masked:
The data location was not masked out.
The data location was masked out, because it had at least one NA
or NaN
value.
The data location was masked out, because it was constant.
This will be a list with components:
The \(T\) by \(Q\) PC score matrix.
The standard deviation of each PC.
The \(P\) by \(Q\) PC directions matrix. Included only if get_dirs
.
The length Q
logical vector indicating scores of high kurtosis.
Detrended components of U
. Included only if components were mean- or variance-detrended.
The length Q
logical vector indicating detrended scores of high kurtosis.
The number of PCs selected by PESEL.
The number of above-average variance PCs.
Q
is the number of PCs selected by PESEL or of above-average variance (or the greater of the two if both were used).
If PCA was not used, all entries except nPCs_PESEL
and/or nPCs_avgvar
will not be included, depending on which
method(s) was used to select the number of PCs.If fusedPCA was used, this will be a list with components:
The \(T\) by \(Q\) PC score matrix.
The standard deviation of each PC.
The \(P\) by \(Q\) PC directions matrix. Included only if get_dirs
The length Q
logical vector indicating scores of high kurtosis.
Detrended components of U
. Included only if components were mean- or variance-detrended.
The length Q
logical vector indicating detrended scores of high kurtosis. Included only if components were mean- or variance-detrended.
If ICA was used, this will be a list with components:
The \(P\) by \(Q\) source signals matrix. Included only if get_dirs
The \(T\) by \(Q\) mixing matrix.
The length Q
logical vector indicating mixing scores of high kurtosis.
Detrended components of M
. Included only if components were mean- or variance-detrended.
The length Q
logical vector indicating detrended mixing scores of high kurtosis. Included only if components were mean- or variance-detrended.