neuroCombat can be installed in R by typing the following commands:
library(devtools)
install_github("jfortin1/CombatHarmonization/R/neuroCombat")
ComBat estimates scanner-specific location and scale parameters, for each feature separately, and pools information across features using empirical Bayes to improve the estimation of those parameters for small sample size studies.
The neuroCombat
function is the main function. It requires two mandatory arguments: - a data matrix (p x n) dat
for which the p rows are features, and the n columns are participants. - a numeric or character vector batch
of length n indicating the site/scanner/study id.
For illustration purpose, let’s simulate an imaging dataset with n=10 participants, acquired on 2 scanners, with 5 participants each, with p=10000 voxels per scan.
library(neuroCombat)
p=10000
n=10
batch = c(1,1,1,1,1,2,2,2,2,2) #Batch variable for the scanner id
dat = matrix(runif(p*n), p, n) #Random Data matrix
We use the function neuroCombat
to harmonize the data across the 2 scanners:
data.harmonized <- neuroCombat(dat=dat, batch=batch)
By default, this uses parametric adjustments. To following command must be used for non-parametric adjustments:
data.harmonized <- neuroCombat(dat=dat, batch=batch, parametric=FALSE)
The harmonized matrix is stored in data.harmonized$dat.combat
. The data.harmonized
object also contains the different parameters estimated by ComBat: - gamma.hat
and delta.hat
: Estimated location and shift (L/S) parameters before empirical Bayes. - gamma.star
and delta.star
: Empirical Bayes estimated L/S parameters. - gamma.bar
, t2
, a.prior
and b.prior
: esimated prior distributions parameters.
neuroCombat
also accepts an optional argument, mod
, which is a matrix containing biological covariates, including the outcome of interest. This is recommended to ensure that biological variability is preserved in the harmonization process. For instance, for a study with age and disease covariates,
age <- c(82,70,68,66,80,69,72,76,74,80) # Continuous variable
disease <- as.factor(c(1,2,1,2,1,2,1,2,1,2)) # Categorical variable
we first create a model matrix for these two biological covariates using the model.matrix
function:
mod <- model.matrix(~age+disease)
mod
> mod
(Intercept) age disease2
1 1 82 0
2 1 70 1
3 1 68 0
4 1 66 1
5 1 80 0
6 1 69 1
7 1 72 0
8 1 76 1
9 1 74 0
10 1 80 1
The matrix mod
is a n x 3 matrix, containing an intercept, age and a dummy variable for the second level of the disease variable (the first level is taken as the baseline group). Note that including an intercept in the model matrix will not change the results of the algorithm; ComBat automatically removes the intercept from the model matrix when fitting the models. We now harmonize the data:
combat.harmonized <- neuroCombat(dat=dat, batch=batch, mod=mod)
Sometimes, it is preferable not to pool information across features, for instance if: - (1) The number of features is substantially smaller than the number of participants (p << n) or - (2) The prior distributions used in ComBat do not fit well the data - (3) The site effects are only present for a small subset of features
An example of (2) is studies with site/scanner effects that are highly heteregenous across features, for instance differential scanner effects between white matter (WM) or grey matter (GM) voxels exist. To run the ComBat model without empirical Bayes, which boils down to fitting a location/shift (L/S) model for each feature separately, the option eb=FALSE
can be used:
data.harmonized <- neuroCombat(dat=dat, batch=batch, eb=FALSE)
Coming soon.