Compute single-sample geneset expression as the
average log-expression f genes in the geneset. Requires log-expression
matrix X and (sparse) geneset matrix matG. If you have gene sets
as a gmt list, please convert it first using the function gmt2mat().
Usage
plaid(
X,
matG,
stats = c("mean", "sum"),
chunk = NULL,
normalize = TRUE,
nsmooth = 3,
assay = "logcounts",
min.genes = 5,
max.genes = 500
)Arguments
- X
Log-transformed expr. matrix. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects.
- matG
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists (named list of gene vectors).
- stats
Score computation stats: mean or sum of intensity. Default 'mean'.
- chunk
Logical: use chunks for large matrices. Default 'NULL' for autodetect.
- normalize
Logical: median normalize results or not. Default 'TRUE'.
- nsmooth
Smoothing parameter for more stable average when stats="mean". Default 3.
- assay
Character: assay name to extract from SummarizedExperiment/SingleCellExperiment. Default "logcounts".
- min.genes
Integer: minimum genes per gene set (for BiocSet/GMT input). Default 5.
- max.genes
Integer: maximum genes per gene set (for BiocSet/GMT input). Default 500.
Details
PLAID needs the gene sets as sparse matrix. If you have your collection of gene sets a a list, we need first to convert the gmt list to matrix format.
We recommend to run PLAID on the log transformed expression matrix, not on the counts, as the average in the logarithmic space is more robust and is in concordance to calculating the geometric mean.
It is not necessary to normalize your expression matrix before running PLAID because PLAID performs median normalization of the enrichment scores afterwards.
It is recommended to use sparse matrix as PLAID relies on sparse matrix computations. But, PLAID is also fast for dense matrices.
PLAID can also be run on the ranked matrix. This corresponds to the singscore (Fouratan et al., 2018). PLAID can also be run on the (non-logarithmic) counts which can be used to calculate the scSE score (Pont et al., 2019).
PLAID is fast and memery efficient because it uses efficient sparse matrix computation. When input matrix is very large, PLAID performs 'chunked' computation by splitting the matrix in chunks.
Although X and matG are generally sparse, the result
matrix gsetX generally is dense and can thus be very large.
Example: computing gene set scores for 10K gene sets on 1M cells
will create a 10K x 1M dense matrix which requires ~75GB memory.
PLAID now automatically detects and handles Bioconductor objects. If X is a SummarizedExperiment or SingleCellExperiment, it will extract the appropriate assay. If matG is a BiocSet object or GMT list, it will be converted to sparse matrix format automatically.
Examples
library(plaid)
# Create example expression matrix
set.seed(123)
X <- matrix(rnorm(1000), nrow = 100, ncol = 10)
rownames(X) <- paste0("GENE", 1:100)
colnames(X) <- paste0("Sample", 1:10)
# Create example gene sets
gmt <- list(
"Pathway1" = paste0("GENE", 1:20),
"Pathway2" = paste0("GENE", 15:35),
"Pathway3" = paste0("GENE", 30:50)
)
matG <- gmt2mat(gmt)
# Compute PLAID scores
gsetX <- plaid(X, matG)
print(dim(gsetX))
#> [1] 3 10
print(gsetX[1:3, 1:5])
#> Sample1 Sample2 Sample3 Sample4 Sample5
#> Pathway2 -12.01495 -11.85949 -12.04643 -11.73753 -11.82438
#> Pathway3 -11.76847 -11.91129 -12.06332 -12.04601 -12.11773
#> Pathway1 -11.82811 -12.03737 -11.90176 -11.99814 -12.07793
# Use sum statistics instead of mean
gsetX_sum <- plaid(X, matG, stats = "sum")
# \donttest{
# Using real data (if available in package)
extdata_path <- system.file("extdata", "pbmc3k-50cells.rda", package = "plaid")
if (file.exists(extdata_path)) {
load(extdata_path)
hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid")
gmt <- read.gmt(hallmarks)
matG <- gmt2mat(gmt)
gsetX <- plaid(X, matG)
}
# }
