Compare multiple leverage measures with pscrub_multi

Calculates leverage to identify outliers in high-dimensional data. Can compute multiple kinds of leverage at once.

pscrub_multi(
  X,
  projection = "ICA_kurt",
  nuisance = "DCT4",
  center = TRUE,
  scale = TRUE,
  comps_mean_dt = FALSE,
  comps_var_dt = FALSE,
  kurt_quantile = 0.99,
  fusedPCA_kwargs = NULL,
  get_dirs = FALSE,
  full_PCA = FALSE,
  get_outliers = TRUE,
  cutoff = 4,
  seed = 0,
  verbose = FALSE
)

Arguments

X	Wide numeric data matrix ($T$ observations by $V$ variables, $T << V$). If `X` represents an fMRI run, $T$ should be the number of timepoints and $V$ should be the number of brainordinate vertices/voxels. Projection scrubbing will measure the outlyingness of each row in `X`.
projection	Leverage works by projecting the data onto directions likely to contain outlier information. Choose at least one of the following: `"PCA"` PCA using the top $k$ PCs. `"PCA_kurt"` PCA using the high-kurtosis PCs among the top $k$. `"PCA2"` PCA using the top $k2$ PCs. `"PCA2_kurt"` PCA using the high-kurtosis PCs among the top $k2$. `"fusedPCA"` fusedPCA using the top $k$ fused PCs. `"fusedPCA_kurt"` fusedPCA using the high-kurtosis fused PCs among the top $k$. `"fusedPCA2"` fusedPCA using the top $k2$ fused PCs. `"fusedPCA2_kurt"` fusedPCA using the high-kurtosis fused PCs among the top $k2$. `"ICA"` ICA using the top $k$ ICs. `"ICA_kurt"` ICA using the high-kurtosis ICs among the top $k$. `"ICA2"` ICA using the top $k2$ ICs. `"ICA2_kurt"` ICA using the high-kurtosis ICs among the top $k2$. where $k$ is the number of components determined by PESEL, and $k2$ is the number of principal components with above-average variance. Use `"all"` to use all projection methods. Default: `"ICA_kurt"`.
nuisance	Nuisance signals to regress from each column of `X`. Should be specified as a design matrix: a $T$ by $N$ numeric matrix where $N$ represents the number of nuisance signals. Or can be "DCT4" (default), which will create a matrix with a constant column (the intercept term) and four DCT bases. This default nuisance regression will have the effect of demeaning and detrending the data by removing low-frequency components. To not perform any nuisance regression set this argument to `NULL`, `0`, or `FALSE`. Detrending is highly recommended for time-series data, especially if there are many time points or evolving circumstances affecting the data. Additionally, if kurtosis is being used to select the projection directions, trends can induce positive or negative kurtosis, contaminating the connection between high kurtosis and outlier presence. Detrending should not be used with non-time-series data because the observations are not temporally related. Additional nuisance regressors can be specified like so: `cbind(1, dct_bases(nrow(x), 4), more_nuisance)`.
center	Center the columns of the data by their medians, and scale the columns of the data by their median absolute deviations (MADs)? Default: `TRUE`. Centering is necessary for computing the projections, so if `center` is `FALSE`, the data must already be centered. Note that centering and scaling occur after nuisance regression, so even if `center` is `FALSE`, the data will be centered on the means if the nuisance regression included an intercept term, as it does by default.
scale	Center the columns of the data by their medians, and scale the columns of the data by their median absolute deviations (MADs)? Default: `TRUE`. Centering is necessary for computing the projections, so if `center` is `FALSE`, the data must already be centered. Note that centering and scaling occur after nuisance regression, so even if `center` is `FALSE`, the data will be centered on the means if the nuisance regression included an intercept term, as it does by default.
comps_mean_dt	Stabilize the mean and variance of each projection component's timecourse prior to computing kurtosis and leverage? These arguments should be `TRUE`, `FALSE` (default), or the number of DCT bases to use for detrending (`TRUE` will use 4). Note that these arguments affect the projection components and not the data itself. Also, if variance-stabilizing but not mean-stabilizing, the components must already be expected to be mean-stabilized, for example if the data was rigorously detrended; otherwise, the results will be invalid. Slow-moving mean and variance patterns in the components will interfere with the roles of kurtosis and leverage in identifying outliers. While `nuisance` can be used to detrend the data, this nuisance regression is estimated non-robustly, since a robust model takes too long to estimate at each data location. On the other hand, `comps_mean_dt` and `comps_var_dt` can be used to apply a robust nuisance regression at each component, since there are much fewer components than original data locations. Thus, even if the data has been detrended with `nuisance` it may be helpful to detrend the components with `comps_mean_dt`; furthermore, the data nuisance regression does not address the potential existence of variance patterns in the components. Overall, we recommend enabling `comps_mean_dt` and `comps_var_dt` unless the data has been cleaned not only with a low-pass filter like DCT nuisance regression, but also with anatomical CompCor, ICA-FIX, or a similar data-driven strategy that takes into account common sources of artifactual trends such as respiration and heartbeat.
comps_var_dt	Stabilize the mean and variance of each projection component's timecourse prior to computing kurtosis and leverage? These arguments should be `TRUE`, `FALSE` (default), or the number of DCT bases to use for detrending (`TRUE` will use 4). Note that these arguments affect the projection components and not the data itself. Also, if variance-stabilizing but not mean-stabilizing, the components must already be expected to be mean-stabilized, for example if the data was rigorously detrended; otherwise, the results will be invalid. Slow-moving mean and variance patterns in the components will interfere with the roles of kurtosis and leverage in identifying outliers. While `nuisance` can be used to detrend the data, this nuisance regression is estimated non-robustly, since a robust model takes too long to estimate at each data location. On the other hand, `comps_mean_dt` and `comps_var_dt` can be used to apply a robust nuisance regression at each component, since there are much fewer components than original data locations. Thus, even if the data has been detrended with `nuisance` it may be helpful to detrend the components with `comps_mean_dt`; furthermore, the data nuisance regression does not address the potential existence of variance patterns in the components. Overall, we recommend enabling `comps_mean_dt` and `comps_var_dt` unless the data has been cleaned not only with a low-pass filter like DCT nuisance regression, but also with anatomical CompCor, ICA-FIX, or a similar data-driven strategy that takes into account common sources of artifactual trends such as respiration and heartbeat.
kurt_quantile	What quantile cutoff should be used to select the components? Default: `0.99`. Use `0` to select all high-variance components regardless of kurtosis value. We model each component as a length $T$ vector of Normal iid random variables, for which the distribution of kurtosis values can be approximated. The quantile is estimated based on this distribution.
fusedPCA_kwargs	Arguments to `fusedPCA` in list form. Valid entries are: lambda Trend-filtering parameter. Default: `5`. niter_max Maximum number of iterations. Default: `1000`. TOL Convergence tolerance parameter. Default: `1e-8`. verbose Print updates? Default: `FALSE`.
get_dirs	Do the projection directions need to be returned? This is the $V$ matrix in PCA and $S$ matrix in ICA. The default is `FALSE` to save memory. However, `get_dirs==TRUE` is required for `artifact_images`.
full_PCA	Only applies to the PCA projection. Return the full SVD? Default: `FALSE` (return only the high-variance components).
get_outliers	Should outliers be flagged based on `cutoff`? Default: `TRUE`.
cutoff	Median leverage cutoff value. Default: `4`.
seed	Set a seed right before the call to `pesel::pesel` or `ica::icaimax`? If `NULL`, do not set a seed. If numeric (default: `0`), will use as the seed.
verbose	Should occasional updates be printed? Default: `FALSE`.

Value

A "pscrub_multi" object, i.e. a list with components

measure

A $T$ by $P$ data.frame of numeric leverage values, each column being the leverage values for a projection method in projection.

measure_info

A data.frame with $P$ rows listing information about each projection used.

outlier_cutoff

A $1$ by $P$ data.frame of numeric outlier cutoff values for each projection (cutoff times the median leverage).

outlier_flag

A $T$ by $P$ data.frame of logical values where TRUE indicates where leverage exceeds the cutoff, signaling suspected outlier presence.

mask

A length $P$ numeric vector corresponding to the data locations in X. Each value indicates whether the location was masked:

1: The data location was not masked out.
-1: The data location was masked out, because it had at least one NA or NaN value.
-2: The data location was masked out, because it was constant.

PCA

This will be a list with components:

U: The $T$ by $Q$ PC score matrix.
D: The standard deviation of each PC.
V: The $P$ by $Q$ PC directions matrix. Included only if get_dirs.
highkurt: The length Q logical vector indicating scores of high kurtosis.
U_dt: Detrended components of U. Included only if components were mean- or variance-detrended.
highkurt: The length Q logical vector indicating detrended scores of high kurtosis.
nPCs_PESEL: The number of PCs selected by PESEL.
nPCs_avgvar: The number of above-average variance PCs.

where Q is the number of PCs selected by PESEL or of above-average variance (or the greater of the two if both were used). If PCA was not used, all entries except nPCs_PESEL and/or nPCs_avgvar will not be included, depending on which method(s) was used to select the number of PCs.

fusedPCA

If fusedPCA was used, this will be a list with components:

U: The $T$ by $Q$ PC score matrix.
D: The standard deviation of each PC.
V: The $P$ by $Q$ PC directions matrix. Included only if get_dirs
highkurt: The length Q logical vector indicating scores of high kurtosis.
U_dt: Detrended components of U. Included only if components were mean- or variance-detrended.
highkurt: The length Q logical vector indicating detrended scores of high kurtosis. Included only if components were mean- or variance-detrended.

ICA

If ICA was used, this will be a list with components:

S: The $P$ by $Q$ source signals matrix. Included only if get_dirs
M: The $T$ by $Q$ mixing matrix.
highkurt: The length Q logical vector indicating mixing scores of high kurtosis.
M_dt: Detrended components of M. Included only if components were mean- or variance-detrended.
highkurt: The length Q logical vector indicating detrended mixing scores of high kurtosis. Included only if components were mean- or variance-detrended.

X	Wide numeric data matrix (\(T\) observations by \(V\) variables, \(T << V\)). If `X` represents an fMRI run, \(T\) should be the number of timepoints and \(V\) should be the number of brainordinate vertices/voxels. Projection scrubbing will measure the outlyingness of each row in `X`.
projection	Leverage works by projecting the data onto directions likely to contain outlier information. Choose at least one of the following: `"PCA"` PCA using the top \(k\) PCs. `"PCA_kurt"` PCA using the high-kurtosis PCs among the top \(k\). `"PCA2"` PCA using the top \(k2\) PCs. `"PCA2_kurt"` PCA using the high-kurtosis PCs among the top \(k2\). `"fusedPCA"` fusedPCA using the top \(k\) fused PCs. `"fusedPCA_kurt"` fusedPCA using the high-kurtosis fused PCs among the top \(k\). `"fusedPCA2"` fusedPCA using the top \(k2\) fused PCs. `"fusedPCA2_kurt"` fusedPCA using the high-kurtosis fused PCs among the top \(k2\). `"ICA"` ICA using the top \(k\) ICs. `"ICA_kurt"` ICA using the high-kurtosis ICs among the top \(k\). `"ICA2"` ICA using the top \(k2\) ICs. `"ICA2_kurt"` ICA using the high-kurtosis ICs among the top \(k2\). where \(k\) is the number of components determined by PESEL, and \(k2\) is the number of principal components with above-average variance. Use `"all"` to use all projection methods. Default: `"ICA_kurt"`.
nuisance	Nuisance signals to regress from each column of `X`. Should be specified as a design matrix: a \(T\) by \(N\) numeric matrix where \(N\) represents the number of nuisance signals. Or can be "DCT4" (default), which will create a matrix with a constant column (the intercept term) and four DCT bases. This default nuisance regression will have the effect of demeaning and detrending the data by removing low-frequency components. To not perform any nuisance regression set this argument to `NULL`, `0`, or `FALSE`. Detrending is highly recommended for time-series data, especially if there are many time points or evolving circumstances affecting the data. Additionally, if kurtosis is being used to select the projection directions, trends can induce positive or negative kurtosis, contaminating the connection between high kurtosis and outlier presence. Detrending should not be used with non-time-series data because the observations are not temporally related. Additional nuisance regressors can be specified like so: `cbind(1, dct_bases(nrow(x), 4), more_nuisance)`.
center	Center the columns of the data by their medians, and scale the columns of the data by their median absolute deviations (MADs)? Default: `TRUE`. Centering is necessary for computing the projections, so if `center` is `FALSE`, the data must already be centered. Note that centering and scaling occur after nuisance regression, so even if `center` is `FALSE`, the data will be centered on the means if the nuisance regression included an intercept term, as it does by default.
scale	Center the columns of the data by their medians, and scale the columns of the data by their median absolute deviations (MADs)? Default: `TRUE`. Centering is necessary for computing the projections, so if `center` is `FALSE`, the data must already be centered. Note that centering and scaling occur after nuisance regression, so even if `center` is `FALSE`, the data will be centered on the means if the nuisance regression included an intercept term, as it does by default.
comps_mean_dt	Stabilize the mean and variance of each projection component's timecourse prior to computing kurtosis and leverage? These arguments should be `TRUE`, `FALSE` (default), or the number of DCT bases to use for detrending (`TRUE` will use 4). Note that these arguments affect the projection components and not the data itself. Also, if variance-stabilizing but not mean-stabilizing, the components must already be expected to be mean-stabilized, for example if the data was rigorously detrended; otherwise, the results will be invalid. Slow-moving mean and variance patterns in the components will interfere with the roles of kurtosis and leverage in identifying outliers. While `nuisance` can be used to detrend the data, this nuisance regression is estimated non-robustly, since a robust model takes too long to estimate at each data location. On the other hand, `comps_mean_dt` and `comps_var_dt` can be used to apply a robust nuisance regression at each component, since there are much fewer components than original data locations. Thus, even if the data has been detrended with `nuisance` it may be helpful to detrend the components with `comps_mean_dt`; furthermore, the data nuisance regression does not address the potential existence of variance patterns in the components. Overall, we recommend enabling `comps_mean_dt` and `comps_var_dt` unless the data has been cleaned not only with a low-pass filter like DCT nuisance regression, but also with anatomical CompCor, ICA-FIX, or a similar data-driven strategy that takes into account common sources of artifactual trends such as respiration and heartbeat.
comps_var_dt	Stabilize the mean and variance of each projection component's timecourse prior to computing kurtosis and leverage? These arguments should be `TRUE`, `FALSE` (default), or the number of DCT bases to use for detrending (`TRUE` will use 4). Note that these arguments affect the projection components and not the data itself. Also, if variance-stabilizing but not mean-stabilizing, the components must already be expected to be mean-stabilized, for example if the data was rigorously detrended; otherwise, the results will be invalid. Slow-moving mean and variance patterns in the components will interfere with the roles of kurtosis and leverage in identifying outliers. While `nuisance` can be used to detrend the data, this nuisance regression is estimated non-robustly, since a robust model takes too long to estimate at each data location. On the other hand, `comps_mean_dt` and `comps_var_dt` can be used to apply a robust nuisance regression at each component, since there are much fewer components than original data locations. Thus, even if the data has been detrended with `nuisance` it may be helpful to detrend the components with `comps_mean_dt`; furthermore, the data nuisance regression does not address the potential existence of variance patterns in the components. Overall, we recommend enabling `comps_mean_dt` and `comps_var_dt` unless the data has been cleaned not only with a low-pass filter like DCT nuisance regression, but also with anatomical CompCor, ICA-FIX, or a similar data-driven strategy that takes into account common sources of artifactual trends such as respiration and heartbeat.
kurt_quantile	What quantile cutoff should be used to select the components? Default: `0.99`. Use `0` to select all high-variance components regardless of kurtosis value. We model each component as a length $T$ vector of Normal iid random variables, for which the distribution of kurtosis values can be approximated. The quantile is estimated based on this distribution.
fusedPCA_kwargs	Arguments to `fusedPCA` in list form. Valid entries are: lambda Trend-filtering parameter. Default: `5`. niter_max Maximum number of iterations. Default: `1000`. TOL Convergence tolerance parameter. Default: `1e-8`. verbose Print updates? Default: `FALSE`.
get_dirs	Do the projection directions need to be returned? This is the \(V\) matrix in PCA and \(S\) matrix in ICA. The default is `FALSE` to save memory. However, `get_dirs==TRUE` is required for `artifact_images`.
full_PCA	Only applies to the PCA projection. Return the full SVD? Default: `FALSE` (return only the high-variance components).
get_outliers	Should outliers be flagged based on `cutoff`? Default: `TRUE`.
cutoff	Median leverage cutoff value. Default: `4`.
seed	Set a seed right before the call to `pesel::pesel` or `ica::icaimax`? If `NULL`, do not set a seed. If numeric (default: `0`), will use as the seed.
verbose	Should occasional updates be printed? Default: `FALSE`.