This package provides an R implementation of ROMA Link to the new article
A Java implementation developed by Andrei Zynovyev and co-authors is also available.
The rRoma package relies on the scater
and
biomaRt
packages, which are available only on BioConductor.
These packages can be installed with the following command. It also
requires ggplot2
for later descriptive plots.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
BiocManager::install("scater")
if(!requireNamespace("ggplot2")){
install.packages("ggplot2")
}
rRoma can then be installed using devtools
.
if(!requireNamespace("devtools")){
install.packages("devtools")
}
if(!requireNamespace("rRoma")){
devtools::install_github("sysbio-curie/rRoma")
}
rRoma uses the mice
package to fill missing values. This
package needs to be installed if analysed datasets contain missing
values. However, we recommend using on datasets for which missing values
have already been imputed.
if(!requireNamespace("mice")){
install.packages("mice")
}
The package can be loaded with the usual syntax, i.e. by typing
library(rRoma)
rRoma requires a gene expression matrix, with column names indicating samples and row names indicating gene names.
It also requires a module file containing information on the genesets that need to be evaluated. The module file can be loaded from a GMT file (see example).
Various functions are then available to explore the analysis.
Let us begin by getting the description of the dataset. This is a RNA sequencing dataset of human airway epithelial cultures from cystic fibrosis (CF) and non-CF donors from GSE176121, PMID: 34166230. rROMA was applied to compare between disease and healthy conditions (CF vs. non-CF). Warning: rRoma works with normalized data. Here cpm from edgeR was used for normalisation, but you could also use vst from DeSeq2 or any other method you deem interesting.
summary(MatData)
## GSM5356205 GSM5356206 GSM5356207 GSM5356208
## Min. :-2.2767 Min. :-2.2767 Min. :-2.2767 Min. :-2.2767
## 1st Qu.:-2.2767 1st Qu.:-2.2767 1st Qu.:-2.2767 1st Qu.:-2.2767
## Median : 0.7358 Median : 0.7857 Median : 0.6825 Median : 0.6317
## Mean : 1.3248 Mean : 1.3373 Mean : 1.3145 Mean : 1.2787
## 3rd Qu.: 4.4071 3rd Qu.: 4.4094 3rd Qu.: 4.4518 3rd Qu.: 4.2986
## Max. :16.6047 Max. :16.5216 Max. :16.2624 Max. :16.5667
## GSM5356209 GSM5356210 GSM5356217 GSM5356218
## Min. :-2.2767 Min. :-2.277 Min. :-2.2767 Min. :-2.2767
## 1st Qu.:-2.2767 1st Qu.:-2.277 1st Qu.:-2.2767 1st Qu.:-2.2767
## Median : 0.7868 Median : 1.003 Median : 0.9984 Median : 0.9614
## Mean : 1.3254 Mean : 1.469 Mean : 1.4643 Mean : 1.4752
## 3rd Qu.: 4.3817 3rd Qu.: 4.637 3rd Qu.: 4.6141 3rd Qu.: 4.7112
## Max. :16.5640 Max. :15.881 Max. :15.8263 Max. :15.5290
## GSM5356219 GSM5356220 GSM5356221 GSM5356222
## Min. :-2.2767 Min. :-2.277 Min. :-2.2767 Min. :-2.277
## 1st Qu.:-2.2767 1st Qu.:-2.277 1st Qu.:-2.2767 1st Qu.:-2.277
## Median : 0.9005 Median : 1.035 Median : 0.7106 Median : 1.067
## Mean : 1.4019 Mean : 1.483 Mean : 1.3196 Mean : 1.471
## 3rd Qu.: 4.4961 3rd Qu.: 4.597 3rd Qu.: 4.4072 3rd Qu.: 4.651
## Max. :16.0042 Max. :16.102 Max. :16.6007 Max. :15.507
We can then load sample labels:
summary(sample_labels)
## sample_id Type
## Length:12 Length:12
## Class :character Class :character
## Mode :character Mode :character
Type <- sample_labels$Type
names(Type) <- sample_labels$sample_id
table(Type)
## Type
## CF non-CF
## 6 6
Now that we have a normalized expression matrix with genes as rows and samples as columns, we need to define the pathway we want to test in our analysis. We can do that by loading a GMT file. Here, we extract all the “HALLMARK” genesets from MSigDB (Liberzon A et al, Cell Syst 2015). We also remove “HALLMARK_” from the names to simplify graphical representations.
Note that we chose the Hallmark database for explanatory purpose. To have a more complete view of biological systems at stake in your study, we recommend using rRoma with multiple databases combined together.
AllHall <- SelectFromMSIGdb("HALLMARK")
## [1] "Searching in MsigDB v6.2"
## [1] "The following genesets have been selected:"
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB (200 genes)"
## [2] "HALLMARK_HYPOXIA (200 genes)"
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS (74 genes)"
## [4] "HALLMARK_MITOTIC_SPINDLE (200 genes)"
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING (42 genes)"
## [6] "HALLMARK_TGF_BETA_SIGNALING (54 genes)"
## [7] "HALLMARK_IL6_JAK_STAT3_SIGNALING (87 genes)"
## [8] "HALLMARK_DNA_REPAIR (150 genes)"
## [9] "HALLMARK_G2M_CHECKPOINT (200 genes)"
## [10] "HALLMARK_APOPTOSIS (161 genes)"
## [11] "HALLMARK_NOTCH_SIGNALING (32 genes)"
## [12] "HALLMARK_ADIPOGENESIS (200 genes)"
## [13] "HALLMARK_ESTROGEN_RESPONSE_EARLY (200 genes)"
## [14] "HALLMARK_ESTROGEN_RESPONSE_LATE (200 genes)"
## [15] "HALLMARK_ANDROGEN_RESPONSE (101 genes)"
## [16] "HALLMARK_MYOGENESIS (200 genes)"
## [17] "HALLMARK_PROTEIN_SECRETION (96 genes)"
## [18] "HALLMARK_INTERFERON_ALPHA_RESPONSE (97 genes)"
## [19] "HALLMARK_INTERFERON_GAMMA_RESPONSE (200 genes)"
## [20] "HALLMARK_APICAL_JUNCTION (200 genes)"
## [21] "HALLMARK_APICAL_SURFACE (44 genes)"
## [22] "HALLMARK_HEDGEHOG_SIGNALING (36 genes)"
## [23] "HALLMARK_COMPLEMENT (200 genes)"
## [24] "HALLMARK_UNFOLDED_PROTEIN_RESPONSE (113 genes)"
## [25] "HALLMARK_PI3K_AKT_MTOR_SIGNALING (105 genes)"
## [26] "HALLMARK_MTORC1_SIGNALING (200 genes)"
## [27] "HALLMARK_E2F_TARGETS (200 genes)"
## [28] "HALLMARK_MYC_TARGETS_V1 (200 genes)"
## [29] "HALLMARK_MYC_TARGETS_V2 (58 genes)"
## [30] "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION (200 genes)"
## [31] "HALLMARK_INFLAMMATORY_RESPONSE (200 genes)"
## [32] "HALLMARK_XENOBIOTIC_METABOLISM (200 genes)"
## [33] "HALLMARK_FATTY_ACID_METABOLISM (158 genes)"
## [34] "HALLMARK_OXIDATIVE_PHOSPHORYLATION (200 genes)"
## [35] "HALLMARK_GLYCOLYSIS (200 genes)"
## [36] "HALLMARK_REACTIVE_OXIGEN_SPECIES_PATHWAY (49 genes)"
## [37] "HALLMARK_P53_PATHWAY (200 genes)"
## [38] "HALLMARK_UV_RESPONSE_UP (158 genes)"
## [39] "HALLMARK_UV_RESPONSE_DN (144 genes)"
## [40] "HALLMARK_ANGIOGENESIS (36 genes)"
## [41] "HALLMARK_HEME_METABOLISM (200 genes)"
## [42] "HALLMARK_COAGULATION (138 genes)"
## [43] "HALLMARK_IL2_STAT5_SIGNALING (200 genes)"
## [44] "HALLMARK_BILE_ACID_METABOLISM (112 genes)"
## [45] "HALLMARK_PEROXISOME (104 genes)"
## [46] "HALLMARK_ALLOGRAFT_REJECTION (200 genes)"
## [47] "HALLMARK_SPERMATOGENESIS (135 genes)"
## [48] "HALLMARK_KRAS_SIGNALING_UP (200 genes)"
## [49] "HALLMARK_KRAS_SIGNALING_DN (200 genes)"
## [50] "HALLMARK_PANCREAS_BETA_CELLS (40 genes)"
AllHall <- lapply(AllHall, function(x){
x$Name <- sub("HALLMARK_", "", x$Name)
x
})
In case you already have a GMT file with modules you want to test,
you can also load them by using the ReadGMTFile
function.
You can now run rRoma on our dataset by simply specifying the expression dataset and the modules you want to test. Here we are fixing the seed for used for random permutations.
# set.seed(69)
rRoma.output <- rRoma.R(MatData, AllHall)
The most important information can be found in the module matrix,
here in rRoma.output$ModuleMatrix
. It contains p and q
values for overdispersion (L1) and shift (Median Exp) for all tested
modules.
head(rRoma.output$ModuleMatrix)
## L1 Median L1 ppv L1 L1/L2 Median L1/L2
## TNFA_SIGNALING_VIA_NFKB 0.1809479 0.2178158 0.95 1.103873 1.368943
## HYPOXIA 0.1747907 0.2192118 0.99 1.137466 1.369228
## CHOLESTEROL_HOMEOSTASIS 0.2270088 0.2937958 0.94 1.458432 1.568515
## MITOTIC_SPINDLE 0.1737191 0.2178158 0.97 1.167411 1.368943
## WNT_BETA_CATENIN_SIGNALING 0.3222136 0.3597925 0.73 1.529207 1.886010
## TGF_BETA_SIGNALING 0.2287674 0.3307323 0.99 1.194487 1.817202
## ppv L1/L2 Median Exp ppv Median Exp q L1 q L1/L2
## TNFA_SIGNALING_VIA_NFKB 0.97 0.142511317 0.34 1 1
## HYPOXIA 0.91 -0.012509054 0.96 1 1
## CHOLESTEROL_HOMEOSTASIS 0.61 -0.008574073 0.97 1 1
## MITOTIC_SPINDLE 0.85 -0.128220680 0.51 1 1
## WNT_BETA_CATENIN_SIGNALING 0.72 0.151428338 0.17 1 1
## TGF_BETA_SIGNALING 0.94 -0.024187817 0.89 1 1
## q Median Exp
## TNFA_SIGNALING_VIA_NFKB 0.9847059
## HYPOXIA 0.9847059
## CHOLESTEROL_HOMEOSTASIS 0.9847059
## MITOTIC_SPINDLE 0.9847059
## WNT_BETA_CATENIN_SIGNALING 0.9847059
## TGF_BETA_SIGNALING 0.9847059
We are interested in two different types of modules:
Shifted modules, whose genes behave differently
from the rest of the genes in at least one sample. Corresponding p value
is in ppv Median Exp
. A q value is also calculated and
given in q Median Exp
. Here we will consider a module as
shifted if p < 0.05. We don’t look at q values here as both the
number of samples and the number of tested modules is small. Consider
looking at q values for larger data sets.
Overdispersed modules, for which the
approximation to one PC is correct. Corresponding p value is in
ppv ML1
. A q value is also calculated and given in
q L1
. Here we will consider a module as overdispersed if p
< 0.05. We don’t look at q values for the same reason as before, but
considered looking at it for larger datasets.
We are first interested in shifted modules.
shifted.modules <- which(rRoma.output$ModuleMatrix[, "ppv Median Exp"] <= 0.05)
We want to see which samples are responsible for the shift. This can be done by looking at Sample Scores. The function plots a heatmap of these score:
Plot.Genesets.Samples(rRoma.output, Selected = shifted.modules, GroupInfo = Type, cluster_cols = TRUE)
## [1] "2 genesets selected"
## [1] "12 samples have an associated group"
In case we didn’t know the groups before the analysis, this representation can help us define groups of samples that behave similarly on a pathway level. Here we already knew the groups. We can see that GSM5356221 behaves as an outlier. It could be interesting to look at metadata to understand why he behaves differently. Here we see that genes related to Apical surface, fatty acid metabolism and heme metabolism tend to be more expressed in cf patients.
We can also visualize how much the module is shifted compared to random gene sets. In particular, we’re interested in PC1 median expression since we focus on shifted pathways:
Plot.Genesets.vs.Sampled(rRoma.output, Selected = shifted.modules, Plot = "PC1Median")
## [1] "2 genesets selected"
The boxplot corresponds to the distribution of median PC1 expression for random gene sets (null distribution), while the red dot corresponds to the value obtained for the tested module. Here we can see that APICAL_SURFACE is the most shifted pathway.
We can then focus on Overdispersed modules. Note that overdispersion is calculated for all modules, including the shifted one. However interpretation of overdispersion for shifted pathways might be misleading. Focusing on pathways that are overdispersed but not shifted will lead to more interpretable results.
overdispersed.modules <- which(rRoma.output$ModuleMatrix[, "ppv L1"] <= 0.05 & rRoma.output$ModuleMatrix[, "ppv Median Exp"] > 0.05)
For such pathways we are interested in both sample scores and gene scores. The former tells us in which sample the effect of the factor is the biggest, while the later tells us which genes are the most important for this module. Overdispersed representation is particularly interesting in modules containing genes that are both activated and inhibited by the pathway.
We first plot the same heatmap as before to visualize sample scores:
Plot.Genesets.Samples(rRoma.output, Selected = overdispersed.modules, GroupInfo = Type, cluster_cols = TRUE)
## [1] "Heatmap is not clustering module activities when less than 2 modules are selected"
## [1] "1 genesets selected"
## [1] "12 samples have an associated group"
It appears that this module is also interesting to compare CF versus control samples.
Then we look at gene weights. To do this we use the
GetTopContrib
function.
When using this function, the number of selected genes can be specified via the ratio of genes of the module (nGenes larger than 0 and smaller than 1) or via the absolute number of genes to return (nGenes integer and larger than 1). Additionally, top contributing genes can be obtained via their correlation with the module score (Mode = “Cor”) of via their weight (Mode = “Wei”) (recommended). To use correlation the expression matrix needs to be specified (via the ExpressionMatrix parameter).
The function will return a table (Table) with the computed information and the matrix used to produce the heatmap (VizMat). The summary heatmap will be plotted only if Plot = TRUE. Here, we will see the 10% of genes in the module that are the most important for this module.
GeneMat <- GetTopContrib(rRoma.output,
Selected = overdispersed.modules,
nGenes = 0.1, OrderType = "Abs", Mode = "Wei", Plot = TRUE)
GeneMat$Table
## Gene Module Weight
## GSN GSN COAGULATION 15.603510
## SERPINA1 SERPINA1 COAGULATION -6.156917
## FN1 FN1 COAGULATION -5.663831
## ITGB3 ITGB3 COAGULATION -3.560363
## PLAT PLAT COAGULATION -3.431828
## PLAU PLAU COAGULATION -3.420021
## MMP9 MMP9 COAGULATION -3.366812
## PRSS23 PRSS23 COAGULATION 3.134257
## F12 F12 COAGULATION -3.001901
## SERPINE1 SERPINE1 COAGULATION -2.698196
## KLKB1 KLKB1 COAGULATION -2.675976
## TIMP3 TIMP3 COAGULATION 2.585418
## C1R C1R COAGULATION -2.497833
## RAPGEF3 RAPGEF3 COAGULATION -2.281372
Similarly to shifted pathways, we can visualize by how much the
pathway is overdispersed using Plot.Genesets.vs.Sampled
,
this time looking at the % of variance explained by the PC1:
Plot.Genesets.vs.Sampled(rRoma.output, Selected = overdispersed.modules, Plot = "L1")
## [1] "1 genesets selected"
In cases where groups are known, we can quantify differences between groups in terms of modules, by comparing sample scores, i.e the activity of tested modules in the different groups.
It is possible to first look at a global scale:
GlobalCompareAcrossSamples(RomaData = rRoma.output,
Selected = c(overdispersed.modules, shifted.modules),
Plot = "both",
Groups = Type)
## [1] "3 geneset(s) selected"
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## [1] "Performing Type III AOV (R default)"
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.1528 0.15283 1.825 0.186
## Residuals 34 2.8472 0.08374
Then at a module scale. For all modules an ANOVA is performed first, followed by wilcoxon test for each 2 by 2 group comparison.
GlobalCompareAcrossSamplesGenesets(RomaData = rRoma.output,
Selected = c(overdispersed.modules, shifted.modules),
Groups = Type)
## [1] "3 geneset(s) selected"
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## [1] "Performing Type III AOV (R default) for each geneset"
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.1528 0.1528 2.929 0.097318 .
## Group:GeneSet 4 1.2819 0.3205 6.142 0.000982 ***
## Residuals 30 1.5653 0.0522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The computation of the PC1 can be affected even by a single outlier in the data set. In order to increase robustness of the PC1 computation, we apply here the “leave-one-out” cross-validation approach (see article for details). This way we can remove outliers from our dataset before calculating PC1. However this kind of outliers removal might lead to removing too many genes in a given module, changing the interpretation of the results. Several filtering steps are thus performed in order to limit the amount of genes considered as outliers:
-Filtering based on Fisher exact test: is a given gene considered outlier in the same proportion as a random gene ? (see article for details)
-Filtering to keep genes characteristic of a small number of modules
-Restricting the number of outliers to a maximum of X% of the size of the module
These filtering steps can be visualized on plots and used as a
baseline to potentially adapt the filtering hyperparameters (see
Performing Roma: changing hyperparameters) by using the
PlotOutliers
function.
Three plots are available:
Plot_L1 helps visualize the leave-one-out approach. For each gene, the % of variance explained by PC1 upon removal of the given gene is plotted.
rRoma::PlotOutliers(rRoma.output, MatData,
Selected = c(overdispersed.modules, shifted.modules),
Plot_L1 = TRUE,
Plot_Weights = FALSE,
Plot_Projection = FALSE)
Plot weights shows the weights of the genes in case no outliers were removed (i.e projetion on PC1).
rRoma::PlotOutliers(rRoma.output, MatData,
Selected = c(overdispersed.modules, shifted.modules),
Plot_L1 = FALSE,
Plot_Weights = TRUE,
Plot_Projection = FALSE)
Plot_Projections plots genes in the PC1/PC2 space.
rRoma::PlotOutliers(rRoma.output, MatData,
Selected = c(overdispersed.modules, shifted.modules),
Plot_L1 = FALSE,
Plot_Weights = FALSE,
Plot_Projection = TRUE)
## Warning: Duplicated aesthetics after name standardisation: colour
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: Duplicated aesthetics after name standardisation: colour
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## Duplicated aesthetics after name standardisation: colour
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
Here we can see that almost all genes that were initially considered as outliers were actually retained, as they were all characteristic for a very small number of modules. It would be interesting to see what changes if we consider these as outliers anyway (see Performing Roma: changing hyperparameters).
In the case of overdispersed modules that are not shifted, all components are computed with undefined orientation sign. Several methods exist to try to orient the PCs (see Choosing the PC orientation procedure). It is possible to visualize both gene weights and expressions for all genes in a given module in order to assess whether PC oroentation seems correct.
PlotGeneWeight(RomaData = rRoma.output, PlotGenes = 30,
ExpressionMatrix = MatData, LogExpression = FALSE,
Selected = c(overdispersed.modules, shifted.modules))
## [1] "3 geneset(s) selected"
## Using Gene as id variables
## Using Gene as id variables
## Using Gene as id variables
The same kind of plot can be obtained with samples:
PlotSampleProjections(RomaData = rRoma.output, PlotSamples = 30,
ExpressionMatrix = MatData, LogExpression = FALSE,
Selected = c(overdispersed.modules, shifted.modules))
## [1] "3 geneset(s) selected"
## Using as id variables
## Using as id variables
## Using as id variables
Amongst the most interesting plots to understand shift and
overdispersion is the PC1/PC2 projection of the genes of a given module
against the background for null distribution. Such a plot is available
with the PlotPCProjections
function:
PlotPCProjections(RomaData = rRoma.output, Selected = c(overdispersed.modules, shifted.modules),
PlotPCProj = c('Points'))
## [1] "3 geneset(s) selected"
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: Removed 66 rows containing missing values (geom_point).
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## Warning: Removed 76 rows containing missing values (geom_point).
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
Here we can clearly the shift for the modules we investigated previously, althought we can see it is only a small effect.
It is also possible to plot multiple information on a specific gene
accross all studied modules. Here we’ll take SERPINE1 as a case study,
since it is present in 9 Hallmark pathways, including Coagulation, which
we found to be overdispersed. Several plots are available within the
ExploreGeneProperties
function:
-PlotExpVar plots the expression variance of the studied gene compared to all other genes in the dataset
-PlotExpMean plots the expression mean of the studied gene compared to all other genes in the dataset
-PlotExpDist plots the distribution of the expression of the studied gene in all samples, and in groups (with differential expression testing) if specified
-PlotWeight plots the gene weight in each of the considered gene sets, compared to all other gene weights. % shows the position of the gene in terms of absolute value
-PlotExpSampleScore plots gene expression against sample score for all considered gene sets and tries to fit a line
Here, we plot all of them (Selected=NULL
is equivalent
to selecting all modules).
ExploreGeneProperties(
rRoma.output,
"SERPINE1",
MatData,
Selected = NULL,
GroupInfo = Type,
PlotExpVar = TRUE,
PlotExpMean= TRUE,
PlotExpDist = TRUE,
PlotWeight = TRUE,
PlotExpSampleScore = TRUE
)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
## exact p-value with ties
## [1] "Gene found in 9 modules"
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here we can see that Serpine1 is an important gene in most of the modules it is included in. This gene is also differentially expressed between CF and non-CF.
Even thought you can run Roma with nothing more than an expression matrix and a list of modules to analyse, the algorithm accepts many hyperpatameters that can be changed to best adapt to your needs. Here we will explain how to use all of them.
Roma’s interpretation is correct only if the data is already centered
at the global scale, i.e all genes have a mean of 0 accross samples. By
default Roma starts by centering the data, however it is possible to
skip this step by specifying centerData=FALSE
. We recommend
doing so only if your data is already centered, to fasten the
analysis.
Counts detection for some samples might have been incorrectly
performed in some samples, and keeping them might lead to incorrect
results from Roma. By specifying ExpFilter=TRUE
, 2
filtering steps are performed. We first make sure that a similar number
of genes is detected in all samples, the threshold being defined by
OutGeneNumber
. Then samples are projected in the gene space
and make sure no sample lies to far from the rest. The number of PCs
used to perform filtering at this step is defined by NComp
,
and the threshold is defined by OutGeneSpace
(note that in
case this parameter is set to NULL, no filtering based on projecting
samples in the gene space is performed) By default some filtering is not
performed, as the input matrix is supposed to be already clean.
It is possible to input prior knowledge about your modules in Roma by
changing prior gene weights. By defaut all prior gene weights are
initialized to 1 (this default value can be changed with the
DefaultWeight
argument). These weights can be increased for
some genes known to be more relevant than the others for a given module,
or the sign of the weight can be changed to account for
activators/inhibitors. Gene weights must be changed in the modules’ GMT
file, and will be taken into acount only if
UseWeights = TRUE
.
The parameters MinGenes
and MaxGenes
determines the minimum and the maximum number of genes from a module
that can be in the expression dataset in order for the module to be
considered in Roma’s analysis.
In order to check whether modules are overdispersed/shifted, their L1
and Median Exp (for PC1) value are compared to L1s and Median Exp values
calculated for random subset of genes of similar size. This step is very
time consuming and a tradeof must be found between precision and speed.
In order not to redefine new random gene sets for each module, modules
of similar sizes share the same background gene sets. The % of
difference needed in terms of numbers of genes to generate new random
gene sets can be changed using the ApproxSamples
parameter.
The number of random gene sets generated at each step is defined by
nSamples
. It is at 100 by default. A higher number will
result in more precision in later p value calculation, but will also be
more time consuming. We don’t advise going under 100. It is also
possible to decide whether or not randomly selected gene sets can
include genes from the module they are compared to, using
GeneSelMode
. Finally, it is possible to choose whether or
not outliers should be filtered out in those sets using
SampleFilter
.
Note: The number of PCs calculated for both modules and random gene
sets can be changed with PCADims
(2 by default). However we
don’t recommend changing it as 2 is a necessary and sufficient number of
dimension to compute all features of Roma.
Several methods exist to filter out outliers. Choice is made with the
GeneOutDetection
parameter (see function documentation for
details). You can be more or less stringent by changing
GeneOutThr
. As explained in Undestanding outliers, there
are then later steps to reduce the number of outliers.
OutlierRarelyFoundThr
lets you choose the minimum number of
modules a gene has to belong to in order to be potentially considered as
an outlier. OutlierFisherThr
lets you choose the p value
threshold for the fisher test. OutliersPerc
lets you choose
the maximum ratio of genes that can be outliers in a module.
Several PC orientation procedures are available and can be chosen by
changing the PCSignMode
parameters. See function
documentation for details about available options. Some methods require
a threshold, which can be set up using the PCSignThr
parameter. Also, some methods can use advantage of information on
groups, which can be provided with the Grouping
parameter,
and specifying GroupPCSign=FALSE
. Finally, some methods
rely on correlation. The type of correlation to perform can be choosen
using the CorMethod
parameter. We recommend using
PCSignMode=UseMeanExpressionAllWeights
with
PCSignThr=0.9
. However, note that none of the proposed
methods is perfect, and it is recommend to manually check the sign for
overdispersed modules.
A parallel environment can be used to speed up calculation (however
this will increase memory usage). It can be activated with the
UseParallel
parameter. The number of cores is defined by
nCores
, and the type of cluster ty use by
ClusType
.
In case the data contains missing values, Roma starts by imputing
them using mice. More parameters can be passed to the mice function
using FillNAMethod
.We however recommend you perform your
own imputation beforehands.
It is possible to speed up calculation by not saving results for
randomized genesets, by specifying FullSampleInfo=FALSE
.
However in that case some plots can no longer be drawn. It is also
possible not to perform PC sign orientation on randomized genesets by
specifying SampleSign=FALSE
, again to speed up
calculation.
If MoreInfo=TRUE
, more detailed information on the
computation be printed, while SuppressWarning=TRUE
will
limit the amount of information outputed.
We recommend the following parameters for a first use of Roma, to
have a good tradeoff between speed and memory usage. However in case it
doesn’t bother you to le the analysis run for a longer time we recommend
increasing nSamples
(to 1000 or 10000) as it will result in
more precise p values. OutlierRarelyFoundThr
should be
placed at about 5% of the total number of modules you are testing.
rRoma.output.2 <- rRoma.R(ExpressionMatrix = MatData, ModuleList = AllHall, UseWeights = TRUE, DefaultWeight = 1, GeneOutDetection = 'L1OutExpOut', PCSignMode = "UseMeanExpressionAllWeights", OutGeneSpace = 5, UseParallel = TRUE, nCores = 3, ClusType = "FORK", OutlierRarelyFoundThr = 2)