
Getting Started with PLAID
BigOmics Analytics
Source:vignettes/01_plaid-vignette.Rmd
01_plaid-vignette.RmdIntroduction
PLAID (Pathway Level Average Intensity Detection) is a novel, ultrafast and memory optimized gene set enrichment scoring algorithm. PLAID demonstrates accurate gene set scoring and outperforms all currently available gene set scoring methods in large bulk and single-cell RNA-seq datasets.
Motivation
In recent years, computational methods have emerged that calculate enrichment of gene signatures within individual samples, rather than across pooled samples. These signatures offer critical insights into the coordinated activity of functionally related genes, proteins or metabolites, enabling identification of unique molecular profiles based on gene set and pathway activity in individual cells and patients. This strategy is pivotal for patient stratification and advancement of personalized medicine. However, the rise of large-scale datasets, including single-cell profiles and population biobanks, has exposed significant computational inefficiencies in existing methods. Current methods often demand excessive runtime and memory resources, becoming impractical for large datasets. Overcoming these limitations is a focus of current efforts by bioinformatics teams in academia and the pharmaceutical industry, as essential to support basic and clinical biomedical research.
Example: Single-cell RNA-seq hallmark scoring
Preparing data
For this vignette, our package includes a small subset of the the pmbc3k single-cell dataset of just 50 cells. Please install the Seurat and SeuratData packages if you want to run this vignette against the full dataset.
library("plaid")
#> Warning: replacing previous import 'matrixStats::colRanks' by
#> 'sparseMatrixStats::colRanks' when loading 'plaid'
load(system.file("extdata", "pbmc3k-50cells.rda", package = "plaid"),verbose=TRUE)
#> Loading objects:
#> X
#> celltype
dim(X)
#> [1] 7728 50Note that X is the normalized expression matrix from the Seurat object, not the raw counts matrix. We recommend to run plaid on the log transformed expression matrix, not on the counts, as the average in the logarithmic space is more robust and is in concordance to calculating the geometric mean.
It is not necessary to normalize your expression matrix before running plaid because plaid normalizes the enrichment scores afterwards. However, again, log transformation is recommended.
It is recommended to keep the expression matrix sparse as much as possible because plaid extensively take advantage of sparse matrix computations. But even for dense matrices plaid is fast.
Preparing gene sets
For convenience we have included the 50 Hallmark genesets in our package. But we encourage you to download larger geneset collections as plaid’s speed advantage will be more apparent for larger datasets and large geneset collections.
Plaid needs the gene sets as sparse matrix. If you have your collection of gene sets a a list, we need first to convert the gmt list to matrix format.
hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid")
gmt <- read.gmt(hallmarks)
matG <- gmt2mat(gmt)
dim(matG)
#> [1] 4386 50If you have your own gene sets stored as gmt files, you can
conveniently use the included read.gmt() function to read
the gmt file.
Calculating the score
The main function to run plaid is plaid(). We run plaid
on our expression matrix X and gene set matrix
matG.
The resulting matrix gsetX contains the single-sample
enrichment scores for the specified gene sets and samples.
Notice that by default plaid performs median normalization of the final results. That also means that it is not necessary to normalize your expression matrix before running plaid. However, generally, log transformation is recommended.
Plaid can also be run on the ranked matrix, we will see later that this corresponds to the singscore (Fouratan et al., 2018). Or plaid could be run on the (non-logarithmic) counts which can be used to calculate the scSE score (Pont et al., 2019).
Very large matrices
Plaid is fast and memory efficient because it uses very efficient
sparse matrix computation in the back. For very large X,
plaid uses chunked computation by splitting the matrix in chunks to
avoid index overflow. Should you encounter errors, please compute your
dataset by subsetting manually the expression matrix and/or gene
sets.
Although X and matG are generally very
sparse, be aware that the result matrix gsetX generally is
dense and therefore can become very large. If you would want to compute
the score of 10.000 gene sets on a million of cells this would create a
large 10.000 x 1.000.000 dense matrix which requires about 75GB of
memory.
Performing a differential expression test
Once we have the gene sets scores we can use these scores for
statistical analysis. We could compute the differential gene set
expression between two groups using a general t-test or limma directly
on the score matrix gsetX.
Another way to test whether a gene set is statistically significant
would be to test whether the fold-change of the genes in the gene sets
are statistically different than zero. That is, we can perform a one
sample t-test on the logFC of the genes of each gene sets and test
whether they are significantly different from zero. The logFC is
computed from the original (log) expression matrix X and
group vector y.
The function dual_test() does both tests: the one-sample
t-test on the logFC and the two-group t-test on the gene set matrix
gsetX.
y <- 1*(celltype == "B")
res <- dual_test(X, y, matG, gsetX=gsetX)The top significant genesets can be shown with
res <- res[order(res[,"p.dual"]),]
head(res)
#> gsetFC size p.fc p.ss
#> HALLMARK_INTERFERON_GAMMA_RESPONSE 0.09923298 155 5.614052e-06 1.067381e-05
#> HALLMARK_ALLOGRAFT_REJECTION 0.08322962 134 5.162037e-04 1.311508e-05
#> HALLMARK_P53_PATHWAY -0.04702449 130 1.378624e-02 2.900893e-04
#> HALLMARK_INTERFERON_ALPHA_RESPONSE 0.09705679 78 1.653989e-03 9.406415e-03
#> HALLMARK_ANGIOGENESIS -0.14291289 4 5.680258e-02 6.098317e-02
#> HALLMARK_IL2_STAT5_SIGNALING 0.02905553 107 3.781802e-01 3.704478e-03
#> p.dual q.dual
#> HALLMARK_INTERFERON_GAMMA_RESPONSE 4.948006e-10 2.474003e-08
#> HALLMARK_ALLOGRAFT_REJECTION 6.014977e-08 1.503744e-06
#> HALLMARK_P53_PATHWAY 3.290696e-05 5.484494e-04
#> HALLMARK_INTERFERON_ALPHA_RESPONSE 9.262597e-05 1.157825e-03
#> HALLMARK_ANGIOGENESIS 1.347047e-02 1.157300e-01
#> HALLMARK_IL2_STAT5_SIGNALING 1.730306e-02 1.157300e-01The column gsetFC corresponds to the difference in gene
set score and also corresponds to the average foldchange of the genes in
the gene set. The column ‘p.fc corresponds to the test on the preranked
logFC, the column ’p.ss’ corresponds to the two-group t-test on the
geneset scores gsetX. The two p-values are then combined
using Stouffer’s method in the column ‘p.dual’ and adjusted for multiple
testing in column q.dual.
We can also show the results as a volcano plot:
fc <- res[,"gsetFC"]
pv <- res[,"p.dual"]
plot( fc, -log10(pv), xlab="logFC", ylab="-log10p", pch=19)
abline(h=0, v=0, lty=2)
text( fc[1:5], -log10(pv[1:5]), rownames(res)[1:5],pos=2)
Replicating ssGSEA, singscore and scSE
Plaid can be used to replicate other enrichment score such as ssGSEA, GSVA, AUCell, Singscore, scSE and UCell. But using plaid, the computation is much faster than the original code.
Replicating singscore
Computing the singscore requires to compute the ranks of the expression matrix. We have wrapped this in a single convenience function:
sing <- replaid.sing(X, matG)We have extensively compared the results of replaid.sing
and from the original singscore R package and we showed
identical result in the score, logFC and p-values.
Replicating ssGSEA
Plaid can also be used to compute the ssGSEA score (Barbie et al., 2009). Using plaid, we can calculate the score upto 100x faster. We have wrapped this in a single convenience function:
ssgsea <- replaid.ssgsea(X, matG, alpha=0)We have extensively compared the results of
replaid.ssgsea() and from the original GSVA R package. Note the
rank weight parameter alpha. For alpha=0 we obtained
identical result for the score, logFC and p-values. For non-zero values
for alpha the results are close but not exactly the same. The default
value in the original publication and in GSVA is
alpha=0.25.
Replicating the scSE score
Single-cell Signature Explorer (scSE) is a fast enrichment algorithm implemented in GO. Computing the scSE score requires running plaid on the linear (not logarithmic) score and perform additional normalization by the total UMI per sample. We have wrapped this in a single convenience function:
scse <- replaid.scse(X, matG, removeLog2=TRUE, scoreMean=FALSE)
#> [replaid.scse] Converting data to linear scale (removing log2)...To replicate the original “sum-of-UMI” scSE score, set
removeLog2=TRUE and scoreMean=FALSE. scSE and
plaid scores become more similar for removeLog2=FALSE and
scoreMean=TRUE.
We have extensively compared the results from
replaid.scse and from the original scSE (implemented in GO
lang) and we showed almost identical results in the score, logFC and
p-values.
Session info
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] plaid_0.99.12 BiocStyle_2.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.40.0 KEGGREST_1.50.0
#> [3] xfun_0.54 bslib_0.9.0
#> [5] sparsesvd_0.2-3 qlcMatrix_0.9.9
#> [7] Biobase_2.70.0 lattice_0.22-7
#> [9] vctrs_0.6.5 tools_4.5.2
#> [11] generics_0.1.4 parallel_4.5.2
#> [13] stats4_4.5.2 tibble_3.3.0
#> [15] AnnotationDbi_1.72.0 RSQLite_2.4.4
#> [17] blob_1.2.4 pkgconfig_2.0.3
#> [19] Matrix_1.7-4 sparseMatrixStats_1.22.0
#> [21] desc_1.4.3 S4Vectors_0.48.0
#> [23] RcppParallel_5.1.11-1 lifecycle_1.0.4
#> [25] compiler_4.5.2 textshaping_1.0.4
#> [27] Biostrings_2.78.0 Seqinfo_1.0.0
#> [29] htmltools_0.5.8.1 sass_0.4.10
#> [31] yaml_2.3.10 BiocSet_1.24.0
#> [33] pkgdown_2.2.0 pillar_1.11.1
#> [35] crayon_1.5.3 jquerylib_0.1.4
#> [37] tidyr_1.3.1 DelayedArray_0.36.0
#> [39] cachem_1.1.0 abind_1.4-8
#> [41] tidyselect_1.2.1 digest_0.6.38
#> [43] slam_0.1-55 dplyr_1.1.4
#> [45] purrr_1.2.0 bookdown_0.45
#> [47] fastmap_1.2.0 grid_4.5.2
#> [49] SparseArray_1.10.1 cli_3.6.5
#> [51] magrittr_2.0.4 S4Arrays_1.10.0
#> [53] Rfast_2.1.5.2 bit64_4.6.0-1
#> [55] rmarkdown_2.30 XVector_0.50.0
#> [57] httr_1.4.7 matrixStats_1.5.0
#> [59] bit_4.6.0 ragg_1.5.0
#> [61] png_0.1-8 memoise_2.0.1
#> [63] evaluate_1.0.5 knitr_1.50
#> [65] GenomicRanges_1.62.0 IRanges_2.44.0
#> [67] BiocIO_1.20.0 zigg_0.0.2
#> [69] rlang_1.1.6 docopt_0.7.2
#> [71] Rcpp_1.1.0 ontologyIndex_2.12
#> [73] glue_1.8.0 DBI_1.2.3
#> [75] BiocManager_1.30.26 BiocGenerics_0.56.0
#> [77] jsonlite_2.0.0 R6_2.6.1
#> [79] plyr_1.8.9 MatrixGenerics_1.22.0
#> [81] systemfonts_1.3.1 fs_1.6.6