neuroCombat can be installed in R by typing the following commands:
library(devtools)
install_github("jfortin1/CombatHarmonization/R/neuroCombat")
ComBat estimates scanner-specific location and scale parameters, for each feature separately, and pools information across features using empirical Bayes to improve the estimation of those parameters for small sample size studies.
The neuroCombat function is the main function. It requires two mandatory arguments: - a data matrix (p x n) dat for which the p rows are features, and the n columns are participants. - a numeric or character vector batch of length n indicating the site/scanner/study id.
For illustration purpose, let’s simulate an imaging dataset with n=10 participants, acquired on 2 scanners, with 5 participants each, with p=10000 voxels per scan.
library(neuroCombat)
p=10000
n=10
batch = c(1,1,1,1,1,2,2,2,2,2) #Batch variable for the scanner id
dat = matrix(runif(p*n), p, n) #Random Data matrixWe use the function neuroCombat to harmonize the data across the 2 scanners:
data.harmonized <- neuroCombat(dat=dat, batch=batch)By default, this uses parametric adjustments. To following command must be used for non-parametric adjustments:
data.harmonized <- neuroCombat(dat=dat, batch=batch, parametric=FALSE)The harmonized matrix is stored in data.harmonized$dat.combat. The data.harmonized object also contains the different parameters estimated by ComBat: - gamma.hat and delta.hat: Estimated location and shift (L/S) parameters before empirical Bayes. - gamma.star and delta.star: Empirical Bayes estimated L/S parameters. - gamma.bar, t2, a.prior and b.prior: esimated prior distributions parameters.
neuroCombat also accepts an optional argument, mod, which is a matrix containing biological covariates, including the outcome of interest. This is recommended to ensure that biological variability is preserved in the harmonization process. For instance, for a study with age and disease covariates,
age <- c(82,70,68,66,80,69,72,76,74,80) # Continuous variable
disease <- as.factor(c(1,2,1,2,1,2,1,2,1,2)) # Categorical variablewe first create a model matrix for these two biological covariates using the model.matrix function:
mod <- model.matrix(~age+disease)
mod
> mod
(Intercept) age disease2
1 1 82 0
2 1 70 1
3 1 68 0
4 1 66 1
5 1 80 0
6 1 69 1
7 1 72 0
8 1 76 1
9 1 74 0
10 1 80 1The matrix mod is a n x 3 matrix, containing an intercept, age and a dummy variable for the second level of the disease variable (the first level is taken as the baseline group). Note that including an intercept in the model matrix will not change the results of the algorithm; ComBat automatically removes the intercept from the model matrix when fitting the models. We now harmonize the data:
combat.harmonized <- neuroCombat(dat=dat, batch=batch, mod=mod)Sometimes, it is preferable not to pool information across features, for instance if: - (1) The number of features is substantially smaller than the number of participants (p << n) or - (2) The prior distributions used in ComBat do not fit well the data - (3) The site effects are only present for a small subset of features
An example of (2) is studies with site/scanner effects that are highly heteregenous across features, for instance differential scanner effects between white matter (WM) or grey matter (GM) voxels exist. To run the ComBat model without empirical Bayes, which boils down to fitting a location/shift (L/S) model for each feature separately, the option eb=FALSE can be used:
data.harmonized <- neuroCombat(dat=dat, batch=batch, eb=FALSE)Coming soon.