Clever: Using PCA Leverage for Outlier Detection in High-Dimensional Data

Amanda Mejia, Preya Shah & Damon Pham

2020-05-03

library(knitr)
opts_chunk$set(autodep = TRUE, cache = TRUE)

Overview

The Clever package implements the PCA leverage outlier detection method for high-dimensional (HD) data, as detailed in this manuscript:

Citation: Mejia, Amanda F., Mary Beth Nebel, Ani Eloyan, Brian Caffo, and Martin A. Lindquist. “PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data.” Biostatistics 18, no. 3 (2017): 521-536. paper link

In summary, the manuscript proposes a method to detect outlier observations in HD data by drawing on the traditional statistical ideas of PCA, leverage, and outlier detection. While the primary application is for detecting outlying time points in an fMRI scan, the method can also be applied to other forms of HD data, such as gene expression data.

Method Outline

As input, the algorithm takes a T x V matrix, Y. In our case, Y represents an fMRI run, where each row of Y is a vectorized volume, and each column represents one timepoint. Next, the algorithm performs the following steps:

  1. Normalize the Y matrix.

  2. Perform PCA on the normalized Y matrix using singular value decomposition (SVD), in order to obtain the PC score matrix, U (of dimension T x T).

  3. To reduce the dimensions, retain the first Q rows of the U matrix corresponding to the first Q < T principal components. We will refer to this submatrix as A (of dimension T x Q). Note: To choose the model order Q, we retain only components with a greater-than-average eigenvalue for the “mean” method, or components with kurtosis greater than 2 for the “kurtosis” method. The number kept will be further restricted if robust distance will be used, since it requires T to be appropriately large relative to Q for estimation of the covariance matrix.

  4. Now we can apply outlier detection on A. The primary method is PCA leverage, though we also propose an alternative called robust distance (see paper for further details). The output of either of these outlier detection methods is a T x 1 vector representing the “outlyingness” of each time point.

  5. Thresholds are used to identify the outliers. We choose 3 thresholds, with increasing level of stringency. Our function outputs the outliers associated with each threshold.

Installation

Install the package from GitHub and load it:

library(devtools)
devtools::install_github('mandymejia/clever')

Tutorial Data

ABIDE is a publicly available resource of neuroimaging and phenotypic information from 1112 subjects consisting of 20 datasets collected at 16 sites (Di Martino and others, 2014). Our simulated dataset is based on resting-state fMRI scans from two subjects collected as part of the ABIDE dataset. The first dataset contains artifacts toward the beginning time point; the second is relatively artifact-free. Axial slices are used instead of the entire volumes to minimize the clever package’s download time.

A Simple Example

Here, we will run through a simple example. First let’s pull the data, as follows:

library(clever)
data(Dat1)
data(Dat2)

The fMRI data for both subjects consist of a single slice from a volume. A brain mask has been applied to vectorize the data, forming a \(T\times V\) (time by voxels or vertices) data matrix.

dim(Dat1)
## [1]  193 4675
dim(Dat2)
## [1]  145 4679

We next run clever on both datasets using all possible combinations of parameters.

clever.Dat1.kurt.lev = clever(Dat1, choose_PCs='kurtosis', 
                              PCA_trend_filtering=FALSE, verbose=TRUE)
## [1] "Centering and scaling the data matrix."
## Warning in clever(Dat1, choose_PCs = "kurtosis", PCA_trend_filtering = FALSE, : Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
## [1] "Computing the PC scores and directions."
## [1] "Choosing PCs with high kurtosis."
## [1] "Computing leverage (the outlyingness measurement)."
## [1] "Identifying outliers."
## [1] "Outliers detected at level 1 (3 x median). Computing leverage images."
## [1] "Done! Organizing results."
clever.Dat2.kurt.lev = clever(Dat2, choose_PCs='kurtosis', 
                              PCA_trend_filtering=FALSE, verbose=TRUE)
## [1] "Centering and scaling the data matrix."
## Warning in clever(Dat2, choose_PCs = "kurtosis", PCA_trend_filtering = FALSE, : Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
## [1] "Computing the PC scores and directions."
## [1] "Choosing PCs with high kurtosis."
## [1] "Computing leverage (the outlyingness measurement)."
## [1] "Identifying outliers."
## [1] "No leverage images: clever did not find any outliers at level 1 (3 x median)."
## [1] "Done! Organizing results."
clever.Dat1.vari.lev = clever(Dat1, choose_PCs = 'variance', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat1, choose_PCs = "variance", PCA_trend_filtering = FALSE): Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
clever.Dat1.kurt.rds = clever(Dat1, choose_PCs='kurtosis', method = 'robdist_subset', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat1, choose_PCs = "kurtosis", method = "robdist_subset", : Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
clever.Dat1.vari.rds = clever(Dat1, choose_PCs = 'variance', method = 'robdist_subset', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat1, choose_PCs = "variance", method = "robdist_subset", : Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
clever.Dat1.kurt.rbd = clever(Dat1, choose_PCs='kurtosis', method = 'robdist', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat1, choose_PCs = "kurtosis", method = "robdist", PCA_trend_filtering = FALSE): Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
clever.Dat1.vari.rbd = clever(Dat1, choose_PCs = 'variance', method = 'robdist', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat1, choose_PCs = "variance", method = "robdist", PCA_trend_filtering = FALSE): Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
clever.Dat2.vari.lev = clever(Dat2, choose_PCs = 'variance', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat2, choose_PCs = "variance", PCA_trend_filtering = FALSE): Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
clever.Dat2.kurt.rds = clever(Dat2, choose_PCs='kurtosis', method = 'robdist_subset', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat2, choose_PCs = "kurtosis", method = "robdist_subset", : Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
clever.Dat2.vari.rds = clever(Dat2, choose_PCs = 'variance', method = 'robdist_subset', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat2, choose_PCs = "variance", method = "robdist_subset", : Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
clever.Dat2.kurt.rbd = clever(Dat2, choose_PCs='kurtosis', method = 'robdist', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat2, choose_PCs = "kurtosis", method = "robdist", PCA_trend_filtering = FALSE): Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
clever.Dat2.vari.rbd = clever(Dat2, choose_PCs = 'variance', method = 'robdist', 
                              PCA_trend_filtering=FALSE)
## Warning in clever(Dat2, choose_PCs = "variance", method = "robdist", PCA_trend_filtering = FALSE): Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
clevers.Dat1 = list(clever.Dat1.kurt.lev, clever.Dat1.vari.lev, clever.Dat1.kurt.rds, clever.Dat1.vari.rds, clever.Dat1.kurt.rbd, clever.Dat1.vari.rbd)

clevers.Dat2 = list(clever.Dat2.kurt.lev, clever.Dat2.vari.lev, clever.Dat2.kurt.rds, clever.Dat2.vari.rds, clever.Dat2.kurt.rbd, clever.Dat2.vari.rbd)

Here are the outliers for the first dataset:

library(ggpubr)
library(gridExtra)
ggplot2::theme_set(ggpubr::theme_pubr())
plt_cell = vector('list', length=6)
plt_row = vector('list', length=3)
for(j in 0:2){
  for(i in 1:2){
    plt_cell[[j*2 + i]] = plot(clevers.Dat1[[j*2 + i]], type='n')
  }
  r = c(plt_cell[(j*2+1):(j*2+2)], common.legend=T, legend='bottom', align='hv', ncol=2)
  plt_row[[j + 1]] = do.call(ggarrange, r)
}

grid.arrange(grobs=plt_row, ncol=1)

And for the second:

plt_cell = vector('list', length=6)
plt_row = vector('list', length=3)
for(j in 0:2){
  for(i in 1:2){
    plt_cell[[j*2 + i]] = plot(clevers.Dat2[[j*2 + i]], type='n')
  }
  r = c(plt_cell[(j*2+1):(j*2+2)], common.legend=T, legend='bottom', align='hv', ncol=2)
  plt_row[[j + 1]] = do.call(ggarrange, r)
}

grid.arrange(grobs=plt_row, ncol=1)

For the first dataset, clever identifies 59-61 and 150-151 consistently across all parameter settings. Other times are variously identified as well, with the mean method of choosing PCs tending to identify more outliers than the kurtosis method. For the second dataset, few or no outliers are identified. Overall, these results are consistent with our prior knowledge of both datasets.

Trend filtering

New to version 1.5 is an option to use the scores from PCATF instead of regular PCA (citation forthcoming: Dr. Daniel McDonald and Lei Ding). Notice how the leverage trends are “filtered” to have flat lines in local neighborhoods. This result not only is easier to analyze visually, but also aligns with the idea that outliers shold occur in clusters if the artifact-causing phenomenon occurs over a period longer than the T_R.

## Warning in clever(Dat1, PCA_trend_filtering = TRUE): Warning: 283 zero-variance voxels (out of 4675). These will be set to zero for estimation of the covariance.
## Warning in clever(Dat2, PCA_trend_filtering = TRUE): Warning: 68 zero-variance voxels (out of 4679). These will be set to zero for estimation of the covariance.
## Warning in clever(Dat2, PCA_trend_filtering = TRUE): Warning: 10 PCs are zero-
## variance. Removing these.

Image reconstruction

We can reconstruct the original fMRI images with the mask used for vectorizing the data. See Matrix_to_VolumeTimeSeries in clever/R/visualize.R for a helper function to do this.

## oro.nifti 0.10.2

Below, we compare the timepoint of median leverage (first) to the timepoint of maximum leverage (second). We choose to use the kurtosis and leverage parameter settings.

The median time point appears normal, whereas the most outlying time point clearly has banding artifacts.

Leverage images

New to 1.5, clever can also display the “leverage images” for each outlying observation. There are two types: the composite of the selected PC directions, weighed by the scores for that observation (without scaling by variance), and the single PC direction with the highest score at that observation. Here are the leverage images at the 60th timepoint for the first dataset.

## [1] "The timepoints meeting the first outlier level threshold:"
## [1] "60"  "61"  "134" "150" "151" "153"

Banding artifacts are clearly visible.