A collection of utility functions used within the anglemania package for manipulating FBMs, calculating statistics, and selecting genes.
Replaces all NaN and Inf values in a numeric vector with NA.
sparse_to_fbm(s_mat)
get_dstat(corr_matrix)
big_mat_list_mean(angl)
get_list_stats(angl)
extract_rows_for_unique_genes(dt, max_n_genes)
prefilter_angl(angl, zscore_mean_threshold = 1, zscore_sn_threshold = 1)
select_genes(
angl,
zscore_mean_threshold = 2,
zscore_sn_threshold = 2,
max_n_genes = NULL,
direction = "both",
adjust_thresholds = TRUE
)
replace_with_na(v)
variable_genes_overlap(
seu,
angl,
zscore_mean_thresholds = NULL,
zscore_sn_thresholds = NULL,
adjust_thresholds = FALSE,
layer = "data",
length_out = 6
)
normalize_matrix(x_mat, normalization_method = "divide_by_total_counts")A sparse matrix.
An FBM object.
An anglemania analysis object. This is used by select_genes
and get_anglemania_genes to select genes based on the provided thresholds.
A data frame containing gene pairs, with columns geneA
and geneB.
Integer specifying the maximum number of genes to select.
If NULL, all genes that pass the thresholds are used. Default is
NULL.
Numeric value specifying the threshold for the absolute mean z-score. Default is 2.
Numeric value specifying the threshold for the SNR z-score. Default is 2.
whether to select genes with positive, negative or both mean z-scores. Default is "both"
Logical. Whether to use adaptive thresholding in
select_genes. Default is FALSE.
A numeric vector.
A Seurat object. If variable features have not been identified,
they will be computed using FindVariableFeatures.
A numeric vector of zscore mean thresholds used
for gene selection from ang_obj. Default is c(0.5, 1:15).
A numeric vector of zscore signal to noise ratio
thresholds used for gene selection from ang_obj. Default is
c(0.5, 1:15).
Character. The assay layer in the Seurat object to use when
calculating variable features (e.g., "counts", "data", or "logcounts").
Default is "data".
A bigstatsr::FBM object containing the matrix to normalize
(typically genes x cells).
A character string specifying the normalization
method to use. One of "divide_by_total_counts" (default) or
"find_residuals".
"divide_by_total_counts" normalizes each cell by its total expression
count and applies log1p.
"find_residuals" computes log1p-transformed residuals after regressing
out total expression.
An FBM object from the bigstatsr
package.
A list with statistical measures including mean, sd,
var, min, and max.
A new FBM object containing the mean values.
A list containing three matrices: mean_zscore,
sds_zscore, and sn_zscore.
A vector of unique gene identifiers.
The input anglemania_object with the
integration_genes slot updated to include the selected genes and
their statistical information.
The input anglemaniaObject with the
integration_genes slot updated to include the selected genes and
their statistical information.
A numeric vector with NaN and Inf values replaced with NA.
A data frame with the following columns:
The threshold value used for gene selection.
The number of genes that are both variable in
seu and selected by ang_obj.
The proportion of variable genes in seu that
are also selected by ang_obj.
The total number of genes selected by
ang_obj at the given threshold.
The input FBM object with normalized values written back in place.
This function modifies the input x_mat by reference.
The function combines the geneA and geneB columns, extracts
unique gene names, and returns the first max_n_genes genes. If
max_n_genes exceeds the number of unique genes available, all unique
genes are returned.
The function performs the following steps:
Checks if the input object is of class anglemania_object-class.
Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.
The function performs the following steps:
Checks if the input object is of class anglemania_object-class.
If max_n_genes is not specified, it uses all genes that pass
the thresholds.
Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.
If no gene pairs meet the criteria, it adjusts the thresholds to the 99th percentile values of the corresponding statistics and re-selects.
Extracts unique genes from the selected gene pairs using
extract_rows_for_unique_genes.
Updates the integration_genes slot of the
anglemania_object
with the selected genes and their statistics.
sparse_to_fbm(): Convert a sparse matrix into a file-backed matrix
(FBM) with efficient memory usage.
get_dstat(): Compute mean, standard deviation, variance,
min, and max of a correlation matrix stored as an
FBM.
big_mat_list_mean(): Calculates the element-wise mean from a list
of FBMs stored in an anglemania_object.
get_list_stats(): Calculate mean, standard deviation, and SNR
across a list of FBMs stored in an anglemania_object.
extract_rows_for_unique_genes(): Extract unique gene identifiers from gene pairs,
returning up to a specified maximum number.
prefilter_angl(): Preselect genes from an
anglemania_object
based on z-score thresholds, to simplify downstream filtering.
select_genes(): Select genes from an anglemania_object
based on z-score thresholds for mean and signal-to-noise ratio (SNR).
Selects genes from an anglemania_object-class based on specified
thresholds for the absolute mean z-score and signal-to-noise ratio
(SNR) z-score. It updates the integration_genes slot of the
anglemania_object with the selected genes and associated
information.
replace_with_na(): replace Nan and Inf values with NA
variable_genes_overlap(): Overlap of Variable Genes Across Thresholds.
This function computes the overlap between variable genes identified in a
Seurat object and the gene sets selected from an "anglemania" analysis
object (ang_obj) at multiple threshold levels. For each threshold, genes
are selected from ang_obj by applying select_genes, and the intersection
of these genes with the variable features of the Seurat object is calculated.
The result is a summary data frame that includes the number and percentage of
variable genes overlapping with the anglemania-selected genes, as well as the
number of genes selected at each threshold.
normalize_matrix(): normalize matrix
Normalize a Filebacked Big Matrix (FBM) using either total-count
scaling or residuals from a linear model. Intended for use with
single-cell RNA-seq gene expression data.
select_genes
extract_rows_for_unique_genes,
intersect_genes, list_stats
s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
fbm_mat
#> A Filebacked Big Matrix of type 'double' with 10 rows and 5 columns.
s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
result <- get_dstat(fbm_mat)
str(result)
#> List of 5
#> $ mean: num [1:5] -0.033 -0.211 0.209 0 0.083
#> $ var : num [1:5] 0.1126 0.2017 0.2051 0.0827 0.5553
#> $ min : num [1:5] -0.56 -1.4 -0.27 -0.61 -1.2
#> $ max : num [1:5] 0.7 0 1.2 0.61 1.9
#> $ sd : num [1:5] 0.336 0.449 0.453 0.288 0.745
result
#> $mean
#> [1] -0.033 -0.211 0.209 0.000 0.083
#>
#> $var
#> [1] 0.11262333 0.20165444 0.20507667 0.08268889 0.55533444
#>
#> $min
#> [1] -0.56 -1.40 -0.27 -0.61 -1.20
#>
#> $max
#> [1] 0.70 0.00 1.20 0.61 1.90
#>
#> $sd
#> [1] 0.3355940 0.4490595 0.4528539 0.2875568 0.7452077
#>
# Create FBMs
mat1 <- matrix(1:9, nrow = 3)
mat2 <- matrix(1:9, nrow = 3)
fbm1 <- bigstatsr::FBM(nrow = nrow(mat1), ncol = ncol(mat1), init = mat1)
fbm2 <- bigstatsr::FBM(nrow = nrow(mat2), ncol = ncol(mat2), init = mat2)
# Create weights
weights <- c(batch1 = 0.5, batch2 = 0.5)
# Create the list of FBMs
fbm_list <- list(batch1 = fbm1, batch2 = fbm2)
# Construct the anglemania_object
angl <- new(
"anglemania_object",
weights = weights,
matrix_list = fbm_list
)
big_mat_list_mean(angl)
#> A Filebacked Big Matrix of type 'double' with 3 rows and 3 columns.
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d9894f525a" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d98637c4f28" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
list_stats(angl) <- get_list_stats(angl)
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
str(list_stats(angl))
#> List of 3
#> $ mean_zscore:Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/RtmpmiLilQ/file1d987b45c3ef.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/RtmpmiLilQ/file1d987b45c3ef.bk"
#> ..$ rds : chr "/tmp/RtmpmiLilQ/file1d987b45c3ef.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
#> $ sds_zscore :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/RtmpmiLilQ/file1d98595a5843.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/RtmpmiLilQ/file1d98595a5843.bk"
#> ..$ rds : chr "/tmp/RtmpmiLilQ/file1d98595a5843.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
#> $ sn_zscore :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/RtmpmiLilQ/file1d9826e4b9be.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/RtmpmiLilQ/file1d9826e4b9be.bk"
#> ..$ rds : chr "/tmp/RtmpmiLilQ/file1d9826e4b9be.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
gene_pairs <- data.frame(
geneA = c("Gene1", "Gene2", "Gene3", "Gene4"),
geneB = c("Gene3", "Gene4", "Gene5", "Gene6")
)
unique_genes <- extract_rows_for_unique_genes(
gene_pairs,
max_n_genes = 3
)
print(unique_genes)
#> [1] "Gene1" "Gene3" "Gene2"
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d982a69473d" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d985148dd7" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
angl <- select_genes(angl,
zscore_mean_threshold = 2.5,
zscore_sn_threshold = 2.5,
max_n_genes = 2000)
#> [1] "Selected 25 genes for integration."
anglemania_genes <- get_anglemania_genes(angl)
# View the selected genes and use for integration
v <- c(1, 2, 3, 4, 5, 6, 7, Inf, 9, NA)
v <- replace_with_na(v)
v
#> [1] 1 2 3 4 5 6 7 NA 9 NA
sce <- sce_example()
seu <- Seurat::CreateSeuratObject(counts = SingleCellExperiment::counts(sce))
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
seu <- Seurat::FindVariableFeatures(seu)
#> Finding variable features for layer counts
angl <- create_anglemania_object(
sce,
batch_key = "batch",
min_cells_per_gene = 1
)
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d9879a6c6b8" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/RtmpmiLilQ/file1d9843f27569" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
head(variable_genes_overlap(seu, angl))
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 119 genes for integration."
#> [1] "Selected 11 genes for integration."
#> [1] "Selected 2 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> zscore_mean_thresholds zscore_sn_thresholds intersecting_genes perc_var_genes
#> 1 0.500000 0.5 300 1.000000000
#> 2 1.101441 0.5 300 1.000000000
#> 3 1.702882 0.5 300 1.000000000
#> 4 2.304323 0.5 119 0.396666667
#> 5 2.905764 0.5 11 0.036666667
#> 6 3.507205 0.5 2 0.006666667
#> number_angl_genes perc_ang_genes
#> 1 300 1
#> 2 300 1
#> 3 300 1
#> 4 119 1
#> 5 11 1
#> 6 2 1
library(bigstatsr)
set.seed(42)
mat <- matrix(rpois(1000, lambda = 5), nrow = 100, ncol = 10)
fbm <- as_FBM(mat)
normalize_matrix(fbm, normalization_method = "divide_by_total_counts")[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 5.050104 4.789858 5.149273 4.605269 3.107610
#> [2,] 5.167175 4.104990 4.682746 4.605269 4.680626
#> [3,] 4.363345 4.104990 5.016571 3.703866 4.861401
#> [4,] 4.917488 4.388541 4.682746 4.384622 4.459799
#> [5,] 4.764556 5.192548 4.178070 5.071516 5.147147
normalize_matrix(fbm, normalization_method = "find_residuals")[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.11434410 0.05550312 0.10903721 0.01912041 -0.2880443664
#> [2,] 0.09742030 -0.06396060 0.05569234 0.03692842 0.0434047229
#> [3,] 0.01356898 -0.05693503 0.09785461 -0.14442728 0.0808357808
#> [4,] 0.04243604 -0.01456288 0.05508198 -0.00548117 -0.0007333372
#> [5,] 0.04158057 0.08936337 -0.10027542 0.06324354 0.0815543115