A collection of utility functions used within the anglemania package for manipulating FBMs, calculating statistics, and selecting genes.
Replaces all NaN and Inf values in a numeric vector with NA.
sparse_to_fbm(s_mat)
get_dstat(corr_matrix)
big_mat_list_mean(angl)
get_list_stats(angl)
extract_rows_for_unique_genes(dt, max_n_genes)
prefilter_angl(angl, zscore_mean_threshold = 1, zscore_sn_threshold = 1)
select_genes(
angl,
zscore_mean_threshold = 2,
zscore_sn_threshold = 2,
max_n_genes = NULL,
direction = "both",
adjust_thresholds = TRUE
)
replace_with_na(v)
variable_genes_overlap(
seu,
angl,
zscore_mean_thresholds = NULL,
zscore_sn_thresholds = NULL,
adjust_thresholds = FALSE,
layer = "data",
length_out = 6
)
normalize_matrix(x_mat, normalization_method = "divide_by_total_counts")
A sparse matrix.
An FBM
object.
An anglemania analysis object. This is used by select_genes
and get_anglemania_genes
to select genes based on the provided thresholds.
A data frame containing gene pairs, with columns geneA
and geneB
.
Integer specifying the maximum number of genes to select.
If NULL
, all genes that pass the thresholds are used. Default is
NULL
.
Numeric value specifying the threshold for the absolute mean z-score. Default is 2.
Numeric value specifying the threshold for the SNR z-score. Default is 2.
whether to select genes with positive, negative or both mean z-scores. Default is "both"
Logical. Whether to use adaptive thresholding in
select_genes
. Default is FALSE.
A numeric vector.
A Seurat object. If variable features have not been identified,
they will be computed using FindVariableFeatures
.
A numeric vector of zscore mean thresholds used
for gene selection from ang_obj
. Default is c(0.5, 1:15)
.
A numeric vector of zscore signal to noise ratio
thresholds used for gene selection from ang_obj
. Default is
c(0.5, 1:15)
.
Character. The assay layer in the Seurat object to use when
calculating variable features (e.g., "counts", "data", or "logcounts").
Default is "data"
.
A bigstatsr::FBM
object containing the matrix to normalize
(typically genes x cells).
A character string specifying the normalization
method to use. One of "divide_by_total_counts"
(default) or
"find_residuals"
.
"divide_by_total_counts"
normalizes each cell by its total expression
count and applies log1p.
"find_residuals"
computes log1p-transformed residuals after regressing
out total expression.
An FBM
object from the bigstatsr
package.
A list with statistical measures including mean
, sd
,
var
, min
, and max
.
A new FBM
object containing the mean values.
A list containing three matrices: mean_zscore
,
sds_zscore
, and sn_zscore
.
A vector of unique gene identifiers.
The input anglemania_object
with the
integration_genes
slot updated to include the selected genes and
their statistical information.
The input anglemaniaObject
with the
integration_genes
slot updated to include the selected genes and
their statistical information.
A numeric vector with NaN and Inf values replaced with NA.
A data frame with the following columns:
The threshold value used for gene selection.
The number of genes that are both variable in
seu
and selected by ang_obj
.
The proportion of variable genes in seu
that
are also selected by ang_obj
.
The total number of genes selected by
ang_obj
at the given threshold.
The input FBM
object with normalized values written back in place.
This function modifies the input x_mat
by reference.
The function combines the geneA
and geneB
columns, extracts
unique gene names, and returns the first max_n_genes
genes. If
max_n_genes
exceeds the number of unique genes available, all unique
genes are returned.
The function performs the following steps:
Checks if the input object is of class anglemania_object-class
.
Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.
The function performs the following steps:
Checks if the input object is of class anglemania_object-class
.
If max_n_genes
is not specified, it uses all genes that pass
the thresholds.
Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.
If no gene pairs meet the criteria, it adjusts the thresholds to the 99th percentile values of the corresponding statistics and re-selects.
Extracts unique genes from the selected gene pairs using
extract_rows_for_unique_genes
.
Updates the integration_genes
slot of the
anglemania_object
with the selected genes and their statistics.
sparse_to_fbm()
: Convert a sparse matrix into a file-backed matrix
(FBM
) with efficient memory usage.
get_dstat()
: Compute mean, standard deviation, variance,
min, and max of a correlation matrix stored as an
FBM
.
big_mat_list_mean()
: Calculates the element-wise mean from a list
of FBM
s stored in an anglemania_object
.
get_list_stats()
: Calculate mean, standard deviation, and SNR
across a list of FBMs stored in an anglemania_object
.
extract_rows_for_unique_genes()
: Extract unique gene identifiers from gene pairs,
returning up to a specified maximum number.
prefilter_angl()
: Preselect genes from an
anglemania_object
based on z-score thresholds, to simplify downstream filtering.
select_genes()
: Select genes from an anglemania_object
based on z-score thresholds for mean and signal-to-noise ratio (SNR).
Selects genes from an anglemania_object-class
based on specified
thresholds for the absolute mean z-score and signal-to-noise ratio
(SNR) z-score. It updates the integration_genes
slot of the
anglemania_object
with the selected genes and associated
information.
replace_with_na()
: replace Nan and Inf values with NA
variable_genes_overlap()
: Overlap of Variable Genes Across Thresholds.
This function computes the overlap between variable genes identified in a
Seurat object and the gene sets selected from an "anglemania" analysis
object (ang_obj
) at multiple threshold levels. For each threshold, genes
are selected from ang_obj
by applying select_genes
, and the intersection
of these genes with the variable features of the Seurat object is calculated.
The result is a summary data frame that includes the number and percentage of
variable genes overlapping with the anglemania-selected genes, as well as the
number of genes selected at each threshold.
normalize_matrix()
: normalize matrix
Normalize a Filebacked Big Matrix (FBM
) using either total-count
scaling or residuals from a linear model. Intended for use with
single-cell RNA-seq gene expression data.
select_genes
extract_rows_for_unique_genes
,
intersect_genes
, list_stats
s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
fbm_mat
#> A Filebacked Big Matrix of type 'double' with 10 rows and 5 columns.
s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
result <- get_dstat(fbm_mat)
str(result)
#> List of 5
#> $ mean: num [1:5] -0.033 -0.211 0.209 0 0.083
#> $ var : num [1:5] 0.1126 0.2017 0.2051 0.0827 0.5553
#> $ min : num [1:5] -0.56 -1.4 -0.27 -0.61 -1.2
#> $ max : num [1:5] 0.7 0 1.2 0.61 1.9
#> $ sd : num [1:5] 0.336 0.449 0.453 0.288 0.745
result
#> $mean
#> [1] -0.033 -0.211 0.209 0.000 0.083
#>
#> $var
#> [1] 0.11262333 0.20165444 0.20507667 0.08268889 0.55533444
#>
#> $min
#> [1] -0.56 -1.40 -0.27 -0.61 -1.20
#>
#> $max
#> [1] 0.70 0.00 1.20 0.61 1.90
#>
#> $sd
#> [1] 0.3355940 0.4490595 0.4528539 0.2875568 0.7452077
#>
# Create FBMs
mat1 <- matrix(1:9, nrow = 3)
mat2 <- matrix(1:9, nrow = 3)
fbm1 <- bigstatsr::FBM(nrow = nrow(mat1), ncol = ncol(mat1), init = mat1)
fbm2 <- bigstatsr::FBM(nrow = nrow(mat2), ncol = ncol(mat2), init = mat2)
# Create weights
weights <- c(batch1 = 0.5, batch2 = 0.5)
# Create the list of FBMs
fbm_list <- list(batch1 = fbm1, batch2 = fbm2)
# Construct the anglemania_object
angl <- new(
"anglemania_object",
weights = weights,
matrix_list = fbm_list
)
big_mat_list_mean(angl)
#> A Filebacked Big Matrix of type 'double' with 3 rows and 3 columns.
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d431655a85d" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d435231e3eb" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
list_stats(angl) <- get_list_stats(angl)
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
str(list_stats(angl))
#> List of 3
#> $ mean_zscore:Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.bk"
#> ..$ rds : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
#> $ sds_zscore :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.bk"
#> ..$ rds : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
#> $ sn_zscore :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#> ..$ extptr :<externalptr>
#> ..$ extptr_rw :<externalptr>
#> ..$ nrow : int 300
#> ..$ ncol : int 300
#> ..$ type : Named int 8
#> .. ..- attr(*, "names")= chr "double"
#> ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.bk"
#> ..$ is_read_only: logi FALSE
#> ..$ address :<externalptr>
#> ..$ address_rw :<externalptr>
#> ..$ bk : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.bk"
#> ..$ rds : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.rds"
#> ..$ is_saved : logi FALSE
#> ..$ type_chr : chr "double"
#> ..$ type_size : int 8
#> ..$ file_size : num 720000
#> ..and 22 methods, of which 8 are possibly relevant:
#> .. add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#> .. initialize, save, show#envRefClass
gene_pairs <- data.frame(
geneA = c("Gene1", "Gene2", "Gene3", "Gene4"),
geneB = c("Gene3", "Gene4", "Gene5", "Gene6")
)
unique_genes <- extract_rows_for_unique_genes(
gene_pairs,
max_n_genes = 3
)
print(unique_genes)
#> [1] "Gene1" "Gene3" "Gene2"
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d432429725b" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d4362276466" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
angl <- select_genes(angl,
zscore_mean_threshold = 2.5,
zscore_sn_threshold = 2.5,
max_n_genes = 2000)
#> [1] "Selected 25 genes for integration."
anglemania_genes <- get_anglemania_genes(angl)
# View the selected genes and use for integration
v <- c(1, 2, 3, 4, 5, 6, 7, Inf, 9, NA)
v <- replace_with_na(v)
v
#> [1] 1 2 3 4 5 6 7 NA 9 NA
sce <- sce_example()
seu <- Seurat::CreateSeuratObject(counts = SingleCellExperiment::counts(sce))
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
seu <- Seurat::FindVariableFeatures(seu)
#> Finding variable features for layer counts
angl <- create_anglemania_object(
sce,
batch_key = "batch",
min_cells_per_gene = 1
)
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#>
| | 0 % elapsed=00s
|==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#>
| | 0 % elapsed=00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d4324a68ec4" which didn't exist..
#>
|========================= | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d4362cd1494" which didn't exist..
#>
|==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
head(variable_genes_overlap(seu, angl))
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 119 genes for integration."
#> [1] "Selected 11 genes for integration."
#> [1] "Selected 2 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> zscore_mean_thresholds zscore_sn_thresholds intersecting_genes perc_var_genes
#> 1 0.500000 0.5 300 1.000000000
#> 2 1.101441 0.5 300 1.000000000
#> 3 1.702882 0.5 300 1.000000000
#> 4 2.304323 0.5 119 0.396666667
#> 5 2.905764 0.5 11 0.036666667
#> 6 3.507205 0.5 2 0.006666667
#> number_angl_genes perc_ang_genes
#> 1 300 1
#> 2 300 1
#> 3 300 1
#> 4 119 1
#> 5 11 1
#> 6 2 1
library(bigstatsr)
set.seed(42)
mat <- matrix(rpois(1000, lambda = 5), nrow = 100, ncol = 10)
fbm <- as_FBM(mat)
normalize_matrix(fbm, normalization_method = "divide_by_total_counts")[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 5.050104 4.789858 5.149273 4.605269 3.107610
#> [2,] 5.167175 4.104990 4.682746 4.605269 4.680626
#> [3,] 4.363345 4.104990 5.016571 3.703866 4.861401
#> [4,] 4.917488 4.388541 4.682746 4.384622 4.459799
#> [5,] 4.764556 5.192548 4.178070 5.071516 5.147147
normalize_matrix(fbm, normalization_method = "find_residuals")[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.11434410 0.05550312 0.10903721 0.01912041 -0.2880443664
#> [2,] 0.09742030 -0.06396060 0.05569234 0.03692842 0.0434047229
#> [3,] 0.01356898 -0.05693503 0.09785461 -0.14442728 0.0808357808
#> [4,] 0.04243604 -0.01456288 0.05508198 -0.00548117 -0.0007333372
#> [5,] 0.04158057 0.08936337 -0.10027542 0.06324354 0.0815543115