Calculates PCA leverage or robust distance and identifies outliers.

clever(
  X,
  projection = "PCA_kurt",
  out_meas = "leverage",
  DVARS = TRUE,
  detrend = TRUE,
  PCATF_kwargs = NULL,
  kurt_quantile = 0.95,
  id_outliers = TRUE,
  lev_cutoff = 4,
  rbd_cutoff = 0.9999,
  R_true = NULL,
  lev_images = FALSE,
  verbose = FALSE
)

Arguments

X

Numerical data matrix. Should be wide (N observations x P variables, \(N >> P\)).

projection

Character vector indicating the projection methods to use. Choose at least one of the following: "PCA_var" for PCA + variance, "PCA_kurt" for PCA + kurtosis, and "PCATF" for PCA Trend Filtering + variance. Or, use "all" to use all projection methods. Default: c("PCA_kurt").

out_meas

Character vector indicating the outlyingness measures to compute. Choose at least one of the following: "leverage" for leverage, "robdist" for robust distance, or "robdist_bootstrap" for robust distane bootstrap (not implemented yet). Or, use "all" to use all methods. Default: c("leverage").

DVARS

Should DVARS (Afyouni and Nichols, 2017) be computed too? Default is TRUE.

detrend

Detrend the PCs before measuring kurtosis or computing leverage or robust distance? Default: TRUE.

Detrending is highly recommended for time-series data, especially if there are many time points or evolving circumstances affecting the data. There are two reasons: first, temporal trends induce positive or negative kurtosis, contaminating the connection between high kurtosis and outlier presence. Second, trends tend to reduce the size of the in-MCD subset for the robust distance method causing many false positives.

Detrending should not be used with non-time-series data because the observations are not temporally related.

In addition to TRUE and FALSE, a third option kurtosis can be used to only detrend the PCs for the purpose of measuring kurtosis, and not for the actual outlyingness measurement.

This option will not affect the PCATF projection, which is never detrended.

PCATF_kwargs

Named list of arguments for PCATF projection method. Only applies if ("PCATF" %in% projection).

Valid entries are:

K

maximum number of PCs to compute (Default: 1000)

lambda

trend filtering parameter (Default: 0.05)

niter_max

maximum number of iterations (Default: 1000)

verbose

Print updates? (Default: FALSE)

kurt_quantile

What cutoff quantile for kurtosis should be used? Only applies if ("PCA_kurt" %in% projection). Default: 0.95.

id_outliers

Should the outliers be identified? Default: TRUE.

lev_cutoff

The outlier cutoff value for leverage, as a multiple of the median leverage. Only used if "leverage" %in% projection and id_outliers. Default: 4, or \(4 * median\).

rbd_cutoff

The outlier cutoff quantile for MCD distance. Only used if "robdist" %in% projection and id_outliers. Default: 0.9999, for the \(0.9999\) quantile.

The quantile is computed from the estimated F distribution.

R_true

The N x N correlation matrix, if known. Used for the bootstrap robust distance method.

lev_images

Should leverage images be computed? If FALSE memory is conserved. Default: FALSE.

verbose

Should occasional updates be printed? Default: FALSE.

Value

A clever object, i.e. a list with components

params

A list of all the arguments used.

projections

PC_var

indices

The indices retained from the original SVD projection to make the variance-based PC projection.

PCs

The PC projection.

PC_kurt

indices

The indices retained from the original SVD projection to make the kurtosis-based PC projection. They are ordered from highest kurtosis to lowest kurtosis.

PCs

The PC projection. PCs are ordered in the standard way, from highest variance to lowest variance, instead of by kurtosis.

PCATF

indices

The indices of the trend-filtered PCs used to make the projection.

PCs

The PCATF result.

outlier_measures

PC_var__lev

The leverage values for the PC_var projection.

PC_kurt__lev

The leverage values for the PC_kurt projection.

PCATF__lev

The leverage values for the PCATF projection.

PC_var__rbd

The robust MCD distance values for the PC_var projection.

PC_kurt__rbd

The robust MCD distance values for the PC_kurt projection.

DVARS_DPD

The Delta percent DVARS values.

DVARS_ZD

The DVARS z-scores.

outlier_cutoffs

lev

The leverage cutoff for outlier detection: lev_cutoff times the median leverage.

MCD

The robust distance cutoff for outlier detection: the rbd_cutoff quantile of the estimated F distribution.

DVARS_DPD

The Delta percent DVARS cutoff: +/- 5 percent

DVARS_ZD

The DVARS z-score cutoff: the one-sided 5 percent significance level with Bonferroni FWER correction.

outlier_flags

PC_var__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_kurt__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PCATF__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_var__robdist

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_kurt__robdist

Logical vector idnicating whether each observation surpasses the outlier cutoff.

DVARS_DPD

Logical vector idnicating whether each observation surpasses the outlier cutoff.

DVARS_ZD

Logical vector idnicating whether each observation surpasses the outlier cutoff.

robdist_info

PC_var__robdist

inMCD

Logical vector indicating whether each observation was in the MCD estimate.

outMCD_scale

The scale for out-of-MCD observations.

Fparam

Named numeric vector: c, m, df1, and df2.

PC_var__robdist

inMCD

Logical vector indicating whether each observation was in the MCD estimate.

outMCD_scale

The scale for out-of-MCD observations.

Fparam

Named numeric vector: c, m, df1, and df2.

MCD_scale

The scale value for out-of-MCD observations, and NA for in-MCD observations. NULL if method is not robust distance.

lev_images

mean

The average of the PC directions, weighted by the unscaled PC scores at each outlying time point (U[i,] * V^T). Row names are the corresponding time points.

top

The PC direction with the highest PC score at each outlying time point. Row names are the corresponding time points.

top_dir

The index of the PC direction with the highest PC score at each outlying time point. Named by timepoint.

Details

clever will use all combinations of the requested projection and out_meas methods that make sense. For example, if projection=c("PCATF", "PCA_var", "PCA_kurt") and out_meas=c("leverage", "robdist") then these five combinations will be used: PCATF with leverage, PCA + variance with leverage, PCA + variance with robust distance, PCA + kurtosis with leverage, and PCA + kurtosis with robust distance. Each method combination will yield its own out_meas time series.

Examples

n_voxels = 1e4
n_timepoints = 100
X = matrix(rnorm(n_timepoints*n_voxels), ncol = n_voxels)

clev = clever(X)