A collection of utility functions used within the anglemania package for manipulating FBMs, calculating statistics, and selecting genes.

Replaces all NaN and Inf values in a numeric vector with NA.

sparse_to_fbm(s_mat)

get_dstat(corr_matrix)

big_mat_list_mean(angl)

get_list_stats(angl)

extract_rows_for_unique_genes(dt, max_n_genes)

prefilter_angl(angl, zscore_mean_threshold = 1, zscore_sn_threshold = 1)

select_genes(
  angl,
  zscore_mean_threshold = 2,
  zscore_sn_threshold = 2,
  max_n_genes = NULL,
  direction = "both",
  adjust_thresholds = TRUE
)

replace_with_na(v)

variable_genes_overlap(
  seu,
  angl,
  zscore_mean_thresholds = NULL,
  zscore_sn_thresholds = NULL,
  adjust_thresholds = FALSE,
  layer = "data",
  length_out = 6
)

normalize_matrix(x_mat, normalization_method = "divide_by_total_counts")

Arguments

s_mat

A sparse matrix.

corr_matrix

An FBM object.

angl

An anglemania analysis object. This is used by select_genes and get_anglemania_genes to select genes based on the provided thresholds.

dt

A data frame containing gene pairs, with columns geneA and geneB.

max_n_genes

Integer specifying the maximum number of genes to select. If NULL, all genes that pass the thresholds are used. Default is NULL.

zscore_mean_threshold

Numeric value specifying the threshold for the absolute mean z-score. Default is 2.

zscore_sn_threshold

Numeric value specifying the threshold for the SNR z-score. Default is 2.

direction

whether to select genes with positive, negative or both mean z-scores. Default is "both"

adjust_thresholds

Logical. Whether to use adaptive thresholding in select_genes. Default is FALSE.

v

A numeric vector.

seu

A Seurat object. If variable features have not been identified, they will be computed using FindVariableFeatures.

zscore_mean_thresholds

A numeric vector of zscore mean thresholds used for gene selection from ang_obj. Default is c(0.5, 1:15).

zscore_sn_thresholds

A numeric vector of zscore signal to noise ratio thresholds used for gene selection from ang_obj. Default is c(0.5, 1:15).

layer

Character. The assay layer in the Seurat object to use when calculating variable features (e.g., "counts", "data", or "logcounts"). Default is "data".

x_mat

A bigstatsr::FBM object containing the matrix to normalize (typically genes x cells).

normalization_method

A character string specifying the normalization method to use. One of "divide_by_total_counts" (default) or "find_residuals".

  • "divide_by_total_counts" normalizes each cell by its total expression count and applies log1p.

  • "find_residuals" computes log1p-transformed residuals after regressing out total expression.

Value

An FBM object from the bigstatsr package.

A list with statistical measures including mean, sd, var, min, and max.

A new FBM object containing the mean values.

A list containing three matrices: mean_zscore, sds_zscore, and sn_zscore.

A vector of unique gene identifiers.

The input anglemania_object with the integration_genes slot updated to include the selected genes and their statistical information.

The input anglemaniaObject with the integration_genes slot updated to include the selected genes and their statistical information.

A numeric vector with NaN and Inf values replaced with NA.

A data frame with the following columns:

threshold

The threshold value used for gene selection.

intersecting_genes

The number of genes that are both variable in seu and selected by ang_obj.

perc_var_genes

The proportion of variable genes in seu that are also selected by ang_obj.

number_angl_genes

The total number of genes selected by ang_obj at the given threshold.

The input FBM object with normalized values written back in place. This function modifies the input x_mat by reference.

Details

The function combines the geneA and geneB columns, extracts unique gene names, and returns the first max_n_genes genes. If max_n_genes exceeds the number of unique genes available, all unique genes are returned.

The function performs the following steps:

  1. Checks if the input object is of class anglemania_object-class.

  2. Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.

The function performs the following steps:

  1. Checks if the input object is of class anglemania_object-class.

  2. If max_n_genes is not specified, it uses all genes that pass the thresholds.

  3. Identifies gene pairs where both the mean z-score and SNR z-score exceed the specified thresholds.

  4. If no gene pairs meet the criteria, it adjusts the thresholds to the 99th percentile values of the corresponding statistics and re-selects.

  5. Extracts unique genes from the selected gene pairs using extract_rows_for_unique_genes.

  6. Updates the integration_genes slot of the anglemania_object with the selected genes and their statistics.

Functions

  • sparse_to_fbm(): Convert a sparse matrix into a file-backed matrix (FBM) with efficient memory usage.

  • get_dstat(): Compute mean, standard deviation, variance, min, and max of a correlation matrix stored as an FBM.

  • big_mat_list_mean(): Calculates the element-wise mean from a list of FBMs stored in an anglemania_object.

  • get_list_stats(): Calculate mean, standard deviation, and SNR across a list of FBMs stored in an anglemania_object.

  • extract_rows_for_unique_genes(): Extract unique gene identifiers from gene pairs, returning up to a specified maximum number.

  • prefilter_angl(): Preselect genes from an anglemania_object based on z-score thresholds, to simplify downstream filtering.

  • select_genes(): Select genes from an anglemania_object based on z-score thresholds for mean and signal-to-noise ratio (SNR). Selects genes from an anglemania_object-class based on specified thresholds for the absolute mean z-score and signal-to-noise ratio (SNR) z-score. It updates the integration_genes slot of the anglemania_object with the selected genes and associated information.

  • replace_with_na(): replace Nan and Inf values with NA

  • variable_genes_overlap(): Overlap of Variable Genes Across Thresholds. This function computes the overlap between variable genes identified in a Seurat object and the gene sets selected from an "anglemania" analysis object (ang_obj) at multiple threshold levels. For each threshold, genes are selected from ang_obj by applying select_genes, and the intersection of these genes with the variable features of the Seurat object is calculated. The result is a summary data frame that includes the number and percentage of variable genes overlapping with the anglemania-selected genes, as well as the number of genes selected at each threshold.

  • normalize_matrix(): normalize matrix Normalize a Filebacked Big Matrix (FBM) using either total-count scaling or residuals from a linear model. Intended for use with single-cell RNA-seq gene expression data.

See also

big_apply, FBM

big_apply, FBM

select_genes

extract_rows_for_unique_genes, intersect_genes, list_stats

Examples

s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
fbm_mat
#> A Filebacked Big Matrix of type 'double' with 10 rows and 5 columns.
s_mat <- Matrix::rsparsematrix(nrow = 10, ncol = 5, density = 0.3)
fbm_mat <- sparse_to_fbm(s_mat)
result <- get_dstat(fbm_mat)
str(result)
#> List of 5
#>  $ mean: num [1:5] -0.033 -0.211 0.209 0 0.083
#>  $ var : num [1:5] 0.1126 0.2017 0.2051 0.0827 0.5553
#>  $ min : num [1:5] -0.56 -1.4 -0.27 -0.61 -1.2
#>  $ max : num [1:5] 0.7 0 1.2 0.61 1.9
#>  $ sd  : num [1:5] 0.336 0.449 0.453 0.288 0.745
result
#> $mean
#> [1] -0.033 -0.211  0.209  0.000  0.083
#> 
#> $var
#> [1] 0.11262333 0.20165444 0.20507667 0.08268889 0.55533444
#> 
#> $min
#> [1] -0.56 -1.40 -0.27 -0.61 -1.20
#> 
#> $max
#> [1] 0.70 0.00 1.20 0.61 1.90
#> 
#> $sd
#> [1] 0.3355940 0.4490595 0.4528539 0.2875568 0.7452077
#> 
# Create FBMs
mat1 <- matrix(1:9, nrow = 3)
mat2 <- matrix(1:9, nrow = 3)

fbm1 <- bigstatsr::FBM(nrow = nrow(mat1), ncol = ncol(mat1), init = mat1)
fbm2 <- bigstatsr::FBM(nrow = nrow(mat2), ncol = ncol(mat2), init = mat2)

# Create weights
weights <- c(batch1 = 0.5, batch2 = 0.5)

# Create the list of FBMs
fbm_list <- list(batch1 = fbm1, batch2 = fbm2)

# Construct the anglemania_object
angl <- new(
  "anglemania_object",
  weights = weights,
  matrix_list = fbm_list
)
big_mat_list_mean(angl)
#> A Filebacked Big Matrix of type 'double' with 3 rows and 3 columns.
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#> 
  |                                                  | 0 % elapsed=00s   
  |==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#> 
  |                                                  | 0 % elapsed=00s   
#> Creating directory "/tmp/Rtmpt97azZ/file1d431655a85d" which didn't exist..
#> 
  |=========================                         | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d435231e3eb" which didn't exist..
#> 
  |==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
list_stats(angl) <- get_list_stats(angl)
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
str(list_stats(angl))
#> List of 3
#>  $ mean_zscore:Reference class 'FBM' [package "bigstatsr"] with 15 fields
#>   ..$ extptr      :<externalptr> 
#>   ..$ extptr_rw   :<externalptr> 
#>   ..$ nrow        : int 300
#>   ..$ ncol        : int 300
#>   ..$ type        : Named int 8
#>   .. ..- attr(*, "names")= chr "double"
#>   ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.bk"
#>   ..$ is_read_only: logi FALSE
#>   ..$ address     :<externalptr> 
#>   ..$ address_rw  :<externalptr> 
#>   ..$ bk          : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.bk"
#>   ..$ rds         : chr "/tmp/Rtmpt97azZ/file1d4350de1e8f.rds"
#>   ..$ is_saved    : logi FALSE
#>   ..$ type_chr    : chr "double"
#>   ..$ type_size   : int 8
#>   ..$ file_size   : num 720000
#>   ..and 22 methods, of which 8 are  possibly relevant:
#>   ..  add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#>   ..  initialize, save, show#envRefClass
#>  $ sds_zscore :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#>   ..$ extptr      :<externalptr> 
#>   ..$ extptr_rw   :<externalptr> 
#>   ..$ nrow        : int 300
#>   ..$ ncol        : int 300
#>   ..$ type        : Named int 8
#>   .. ..- attr(*, "names")= chr "double"
#>   ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.bk"
#>   ..$ is_read_only: logi FALSE
#>   ..$ address     :<externalptr> 
#>   ..$ address_rw  :<externalptr> 
#>   ..$ bk          : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.bk"
#>   ..$ rds         : chr "/tmp/Rtmpt97azZ/file1d435dd0b5d1.rds"
#>   ..$ is_saved    : logi FALSE
#>   ..$ type_chr    : chr "double"
#>   ..$ type_size   : int 8
#>   ..$ file_size   : num 720000
#>   ..and 22 methods, of which 8 are  possibly relevant:
#>   ..  add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#>   ..  initialize, save, show#envRefClass
#>  $ sn_zscore  :Reference class 'FBM' [package "bigstatsr"] with 15 fields
#>   ..$ extptr      :<externalptr> 
#>   ..$ extptr_rw   :<externalptr> 
#>   ..$ nrow        : int 300
#>   ..$ ncol        : int 300
#>   ..$ type        : Named int 8
#>   .. ..- attr(*, "names")= chr "double"
#>   ..$ backingfile : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.bk"
#>   ..$ is_read_only: logi FALSE
#>   ..$ address     :<externalptr> 
#>   ..$ address_rw  :<externalptr> 
#>   ..$ bk          : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.bk"
#>   ..$ rds         : chr "/tmp/Rtmpt97azZ/file1d437e621b0b.rds"
#>   ..$ is_saved    : logi FALSE
#>   ..$ type_chr    : chr "double"
#>   ..$ type_size   : int 8
#>   ..$ file_size   : num 720000
#>   ..and 22 methods, of which 8 are  possibly relevant:
#>   ..  add_columns, bm, bm.desc, check_dimensions, check_write_permissions,
#>   ..  initialize, save, show#envRefClass
gene_pairs <- data.frame(
  geneA = c("Gene1", "Gene2", "Gene3", "Gene4"),
  geneB = c("Gene3", "Gene4", "Gene5", "Gene6")
)
unique_genes <- extract_rows_for_unique_genes(
  gene_pairs,
  max_n_genes = 3
)
print(unique_genes)
#> [1] "Gene1" "Gene3" "Gene2"
sce <- sce_example()
angl <- create_anglemania_object(sce, batch_key = "batch")
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#> 
  |                                                  | 0 % elapsed=00s   
  |==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#> 
  |                                                  | 0 % elapsed=00s   
#> Creating directory "/tmp/Rtmpt97azZ/file1d432429725b" which didn't exist..
#> 
  |=========================                         | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d4362276466" which didn't exist..
#> 
  |==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
angl <- select_genes(angl,
                      zscore_mean_threshold = 2.5,
                     zscore_sn_threshold = 2.5,
                     max_n_genes = 2000)
#> [1] "Selected 25 genes for integration."
anglemania_genes <- get_anglemania_genes(angl)
# View the selected genes and use for integration
v <- c(1, 2, 3, 4, 5, 6, 7, Inf, 9, NA)
v <- replace_with_na(v)
v
#>  [1]  1  2  3  4  5  6  7 NA  9 NA
sce <- sce_example()
seu <- Seurat::CreateSeuratObject(counts = SingleCellExperiment::counts(sce))
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
seu <- Seurat::FindVariableFeatures(seu)
#> Finding variable features for layer counts
angl <- create_anglemania_object(
  sce,
  batch_key = "batch",
  min_cells_per_gene = 1
)
#> No dataset_key specified.
#> Assuming that all samples belong to the same dataset and are separated by batch_key: batch
#> Using the intersection of filtered genes from all batches...
#> Number of genes in intersected set: 300
#> Extracting count matrices...
#> Filtering each batch to at least 1 cells per gene...
#> 
  |                                                  | 0 % elapsed=00s   
  |==================================================| 100% elapsed=00s, remaining~00s
angl <- anglemania(angl)
#> Computing angles and transforming to z-scores...
#> 
  |                                                  | 0 % elapsed=00s   
#> Creating directory "/tmp/Rtmpt97azZ/file1d4324a68ec4" which didn't exist..
#> 
  |=========================                         | 50% elapsed=00s, remaining~00s
#> Creating directory "/tmp/Rtmpt97azZ/file1d4362cd1494" which didn't exist..
#> 
  |==================================================| 100% elapsed=00s, remaining~00s
#> Computing statistics...
#> Weighting matrix_list...
#> Calculating mean...
#> Calculating sds...
#> Filtering features...
#> Selecting features...
#> [1] "Selected 25 genes for integration."
head(variable_genes_overlap(seu, angl))
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 300 genes for integration."
#> [1] "Selected 119 genes for integration."
#> [1] "Selected 11 genes for integration."
#> [1] "Selected 2 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 4 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#> [1] "Selected 1 genes for integration."
#>   zscore_mean_thresholds zscore_sn_thresholds intersecting_genes perc_var_genes
#> 1               0.500000                  0.5                300    1.000000000
#> 2               1.101441                  0.5                300    1.000000000
#> 3               1.702882                  0.5                300    1.000000000
#> 4               2.304323                  0.5                119    0.396666667
#> 5               2.905764                  0.5                 11    0.036666667
#> 6               3.507205                  0.5                  2    0.006666667
#>   number_angl_genes perc_ang_genes
#> 1               300              1
#> 2               300              1
#> 3               300              1
#> 4               119              1
#> 5                11              1
#> 6                 2              1
library(bigstatsr)
set.seed(42)
mat <- matrix(rpois(1000, lambda = 5), nrow = 100, ncol = 10)
fbm <- as_FBM(mat)

normalize_matrix(fbm, normalization_method = "divide_by_total_counts")[1:5, 1:5]
#>          [,1]     [,2]     [,3]     [,4]     [,5]
#> [1,] 5.050104 4.789858 5.149273 4.605269 3.107610
#> [2,] 5.167175 4.104990 4.682746 4.605269 4.680626
#> [3,] 4.363345 4.104990 5.016571 3.703866 4.861401
#> [4,] 4.917488 4.388541 4.682746 4.384622 4.459799
#> [5,] 4.764556 5.192548 4.178070 5.071516 5.147147
normalize_matrix(fbm, normalization_method = "find_residuals")[1:5, 1:5]
#>            [,1]        [,2]        [,3]        [,4]          [,5]
#> [1,] 0.11434410  0.05550312  0.10903721  0.01912041 -0.2880443664
#> [2,] 0.09742030 -0.06396060  0.05569234  0.03692842  0.0434047229
#> [3,] 0.01356898 -0.05693503  0.09785461 -0.14442728  0.0808357808
#> [4,] 0.04243604 -0.01456288  0.05508198 -0.00548117 -0.0007333372
#> [5,] 0.04158057  0.08936337 -0.10027542  0.06324354  0.0815543115