vignettes/anglemania_tutorial.Rmd
anglemania_tutorial.Rmd
anglemania is a feature selection package that extracts genes from multi-batch scRNA-seq experiments for downstream dataset integration. The goal is to select genes that carry high biological information and low technical noise between the batches. Those genes are extracted from gene pairs that have an invariant and extremely narrow or wide angle between their expression vectors. Conventionally, highly-variable genes (HVGs) or sometimes all genes are used for integration tasks (https://www.nature.com/articles/s41592-021-01336-8). While HVGs are a great and easy way to reduce the noise and dimensionality of the data, they are not optimal for integration tasks. HVGs are sensitive to batch effects because the variance is a function of both the technical and biological variance. anglemania improves conventional usage of HVGs for integration tasks, especially when the transcriptional difference between cell types or cell states is subtle (showcased here with de.facLoc set to 0.1 (= mild differences between “Groups”)). The package can be used on top of SingleCellExperiment or Seurat objects.
Under the hood, anglemania works with file-backed big matrices (FBMs) from the bigstatsr package (https://github.com/privefl/bigstatsr/) for fast and memory efficient computation.
install.packages("RcppPlanc", repos = "https://welch-lab.r-universe.dev")
## Installing package into '/home/runner/work/_temp/Library'
## (as 'lib' is unspecified)
batch.facLoc <- 0.1
de.facLoc <- 0.1
nBatches <- 4
nGroups <- 3
nGenes <- 5000
groupCells <- 300
sce <- splatSimulate(
batchCells = rep(groupCells * nGroups, nBatches),
batch.facLoc = batch.facLoc,
group.prob = rep(1/nGroups, nGroups),
nGenes = nGenes,
batch.facScale = 0.1,
method = "groups",
verbose = FALSE,
out.prob = 0.001,
de.prob = 0.1, # mild
de.facLoc = de.facLoc,
de.facScale = 0.1,
bcv.common = 0.1,
seed = 42
)
sce
## class: SingleCellExperiment
## dim: 5000 3600
## metadata(1): Params
## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
## rownames(5000): Gene1 Gene2 ... Gene4999 Gene5000
## rowData names(11): Gene BaseGeneMean ... DEFacGroup2 DEFacGroup3
## colnames(3600): Cell1 Cell2 ... Cell3599 Cell3600
## colData names(4): Cell Batch Group ExpLibSize
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
sce_unintegrated <- sce
# Normalization.
sce_unintegrated <- logNormCounts(sce_unintegrated)
# Feature selection.
dec <- modelGeneVar(sce_unintegrated)
hvg <- getTopHVGs(dec, prop = 0.1)
# PCA.
set.seed(1234)
sce_unintegrated <- scater::runPCA(sce_unintegrated, ncomponents = 50, subset_row = hvg)
# Clustering.
colLabels(sce_unintegrated) <- clusterCells(sce_unintegrated,
use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain")
)
# Visualization.
sce_unintegrated <- scater::runUMAP(sce_unintegrated, dimred = "PCA")
plotUMAP(sce_unintegrated, colour_by = "Batch")
plotUMAP(sce_unintegrated, colour_by = "Group")
## DataFrame with 6 rows and 4 columns
## Cell Batch Group ExpLibSize
## <character> <character> <factor> <numeric>
## Cell1 Cell1 Batch1 Group1 46898.1
## Cell2 Cell2 Batch1 Group1 54688.2
## Cell3 Cell3 Batch1 Group2 52027.9
## Cell4 Cell4 Batch1 Group1 52319.5
## Cell5 Cell5 Batch1 Group3 37774.7
## Cell6 Cell6 Batch1 Group3 58112.3
batch_key <- "Batch"
angl <- create_anglemania_object(
sce,
batch_key = batch_key
)
## No dataset_key specified.
## Assuming that all samples belong to the same dataset and are separated by batch_key: Batch
## Extracting count matrices...
## Filtering each batch to at least 1 cells per gene...
## Using the intersection of filtered genes from all batches...
## Number of genes in intersected set: 4970
angl
## anglemania_object
## --------------
## Dataset key: NA
## Batch key: Batch
## Number of datasets: 1
## Total number of batches: 4
## Batches (showing first 5):
## Batch1, Batch2, Batch3, Batch4
## Number of intersected genes: 4970
## Intersected genes (showing first 10):
## Gene1, Gene2, Gene3, Gene4, Gene5, Gene6, Gene7, Gene8, Gene9, Gene10 , ...
## Min cells per gene: 1
angl <- anglemania(angl,
zscore_mean_threshold = 2,
zscore_sn_threshold = 2,
max_n_genes = 2000 # optionally define a max number of genes. default is 2000
)
## Computing angles and transforming to z-scores...
## Computing statistics...
## Weighting matrix_list...
## Calculating mean...
## Calculating sds...
## Filtering features...
## [1] "Selected 1698 genes for integration."
# If you think the number of selected genes is too high or low you can adjust the thresholds:
angl <- select_genes(angl,
zscore_mean_threshold = 2.5,
zscore_sn_threshold = 2.5)
## [1] "Selected 348 genes for integration."
# Inspect the anglemania genes
integration_genes <- get_anglemania_genes(angl)
head(integration_genes)
## [1] "Gene71" "Gene3527" "Gene2470" "Gene4030" "Gene355" "Gene394"
length(integration_genes)
## [1] 348
## ! No human mitochondrial gene found in the union of dataset "Batch1"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch1"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch1"✔ calculating QC for dataset "Batch1" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch2"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch2"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch2"✔ calculating QC for dataset "Batch2" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch3"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch3"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch3"✔ calculating QC for dataset "Batch3" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch4"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch4"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch4"✔ calculating QC for dataset "Batch4" ... done
##
## ℹ Normalizing datasets "Batch1"
## ℹ Normalizing datasets "Batch2"
## ℹ Normalizing datasets "Batch3"
## ℹ Normalizing datasets "Batch4"
## ✔ Normalizing datasets "Batch4" ... done
##
## ℹ Normalizing datasets "Batch3"✔ Normalizing datasets "Batch3" ... done
##
## ℹ Normalizing datasets "Batch2"✔ Normalizing datasets "Batch2" ... done
##
## ℹ Normalizing datasets "Batch1"✔ Normalizing datasets "Batch1" ... done
##
## ℹ Selecting variable features for dataset "Batch1"
## ✔ ... 320 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch2"
## ✔ ... 320 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch3"
## ✔ ... 320 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch4"
## ✔ ... 320 features selected out of 5000 shared features.
## ✔ Finally 490 shared variable features are selected.
## ℹ Scaling dataset "Batch1"
## ✔ Scaling dataset "Batch1" ... done
##
## ℹ Scaling dataset "Batch2"
## ✔ Scaling dataset "Batch2" ... done
##
## ℹ Scaling dataset "Batch3"
## ✔ Scaling dataset "Batch3" ... done
##
## ℹ Scaling dataset "Batch4"
## ✔ Scaling dataset "Batch4" ... done
## ℹ Generating UMAP on unaligned cell factor loadings
## ✔ Generating UMAP on unaligned cell factor loadings ... done
##
## ℹ DimRed "UMAP" is now set as default.
## Warning in .check_reddim_names(x, value, withDimnames): non-NULL 'rownames(value)' should be the same as 'colnames(x)' for
## 'reducedDim<-'. This will be an error in the next release of
## Bioconductor.
## ! No human mitochondrial gene found in the union of dataset "Batch1"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch1"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch1"✔ calculating QC for dataset "Batch1" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch2"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch2"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch2"✔ calculating QC for dataset "Batch2" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch3"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch3"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch3"✔ calculating QC for dataset "Batch3" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch4"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch4"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch4"✔ calculating QC for dataset "Batch4" ... done
##
## ℹ Normalizing datasets "Batch1"
## ℹ Normalizing datasets "Batch2"
## ℹ Normalizing datasets "Batch3"
## ℹ Normalizing datasets "Batch4"
## ✔ Normalizing datasets "Batch4" ... done
##
## ℹ Normalizing datasets "Batch3"✔ Normalizing datasets "Batch3" ... done
##
## ℹ Normalizing datasets "Batch2"✔ Normalizing datasets "Batch2" ... done
##
## ℹ Normalizing datasets "Batch1"✔ Normalizing datasets "Batch1" ... done
##
## ℹ Selecting variable features for dataset "Batch1"
## ✔ ... 2000 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch2"
## ✔ ... 2000 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch3"
## ✔ ... 2000 features selected out of 5000 shared features.
## ℹ Selecting variable features for dataset "Batch4"
## ✔ ... 2000 features selected out of 5000 shared features.
## ✔ Finally 2598 shared variable features are selected.
## ℹ Scaling dataset "Batch1"
## ✔ Scaling dataset "Batch1" ... done
##
## ℹ Scaling dataset "Batch2"
## ✔ Scaling dataset "Batch2" ... done
##
## ℹ Scaling dataset "Batch3"
## ✔ Scaling dataset "Batch3" ... done
##
## ℹ Scaling dataset "Batch4"
## ✔ Scaling dataset "Batch4" ... done
## ℹ Generating UMAP on unaligned cell factor loadings
## ✔ Generating UMAP on unaligned cell factor loadings ... done
##
## ℹ DimRed "UMAP" is now set as default.
## Warning in .check_reddim_names(x, value, withDimnames): non-NULL 'rownames(value)' should be the same as 'colnames(x)' for
## 'reducedDim<-'. This will be an error in the next release of
## Bioconductor.
## ! No human mitochondrial gene found in the union of dataset "Batch1"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch1"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch1"✔ calculating QC for dataset "Batch1" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch2"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch2"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch2"✔ calculating QC for dataset "Batch2" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch3"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch3"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch3"✔ calculating QC for dataset "Batch3" ... done
##
## ! No human mitochondrial gene found in the union of dataset "Batch4"
## ! No human ribosomal gene found in the union of dataset NA
## ! No human hemoglobin gene found in the union of dataset NA
## ℹ calculating QC for dataset "Batch4"
## ℹ Updated QC variables: "nUMI", "nGene", "mito", "ribo", and "hemo"
## ℹ calculating QC for dataset "Batch4"✔ calculating QC for dataset "Batch4" ... done
##
## ℹ Normalizing datasets "Batch1"
## ℹ Normalizing datasets "Batch2"
## ℹ Normalizing datasets "Batch3"
## ℹ Normalizing datasets "Batch4"
## ✔ Normalizing datasets "Batch4" ... done
##
## ℹ Normalizing datasets "Batch3"✔ Normalizing datasets "Batch3" ... done
##
## ℹ Normalizing datasets "Batch2"✔ Normalizing datasets "Batch2" ... done
##
## ℹ Normalizing datasets "Batch1"✔ Normalizing datasets "Batch1" ... done
##
## ℹ Scaling dataset "Batch1"
## ✔ Scaling dataset "Batch1" ... done
##
## ℹ Scaling dataset "Batch2"
## ✔ Scaling dataset "Batch2" ... done
##
## ℹ Scaling dataset "Batch3"
## ✔ Scaling dataset "Batch3" ... done
##
## ℹ Scaling dataset "Batch4"
## ✔ Scaling dataset "Batch4" ... done
## ℹ Generating UMAP on unaligned cell factor loadings
## ✔ Generating UMAP on unaligned cell factor loadings ... done
##
## ℹ DimRed "UMAP" is now set as default.
## Warning in .check_reddim_names(x, value, withDimnames): non-NULL 'rownames(value)' should be the same as 'colnames(x)' for
## 'reducedDim<-'. This will be an error in the next release of
## Bioconductor.
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_liger_hvg_320") +
ggtitle("Integration using top 320 HVGs, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_liger_hvg_320") +
ggtitle("Integration using top 320 HVGs, colored by Group")
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_liger_hvg_2000") +
ggtitle("Integration using top 2000 HVGs, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_liger_hvg_2000") +
ggtitle("Integration using top 2000 HVGs, colored by Group")
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_liger_anglemania") +
ggtitle("Integration using anglemania genes, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_liger_anglemania") +
ggtitle("Integration using anglemania genes, colored by Group")
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_hvg_320") +
ggtitle("MNN integration using top 320 HVGs, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_hvg_320") +
ggtitle("MNN integration using top 320 HVGs, colored by Group")
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_hvg_2000") +
ggtitle("MNN integration using top 2000 HVGs, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_hvg_2000") +
ggtitle("MNN integration using top 2000 HVGs, colored by Group")
plotReducedDim(sce, colour_by = "Batch", dimred = "umap_MNN_anglemania") +
ggtitle("MNN integration using anglemania genes, colored by Batch")
plotReducedDim(sce, colour_by = "Group", dimred = "umap_MNN_anglemania") +
ggtitle("MNN integration using anglemania genes, colored by Group")
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] batchelor_1.22.0 rliger_2.1.0
## [3] bluster_1.16.0 scran_1.34.0
## [5] scater_1.34.0 ggplot2_3.5.1
## [7] scuttle_1.16.0 splatter_1.30.0
## [9] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
## [11] Biobase_2.66.0 GenomicRanges_1.58.0
## [13] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [15] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [17] MatrixGenerics_1.18.1 matrixStats_1.5.0
## [19] Seurat_5.2.1 SeuratObject_5.0.2
## [21] sp_2.2-0 dplyr_1.1.4
## [23] anglemania_0.99.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.2
## [3] later_1.4.1 tibble_3.2.1
## [5] polyclip_1.10-7 fastDummies_1.7.5
## [7] lifecycle_1.0.4 edgeR_4.4.2
## [9] doParallel_1.0.17 globals_0.16.3
## [11] lattice_0.22-6 MASS_7.3-61
## [13] backports_1.5.0 magrittr_2.0.3
## [15] limma_3.62.2 plotly_4.10.4
## [17] sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10
## [21] bigparallelr_0.3.2 metapod_1.14.0
## [23] httpuv_1.6.15 sctransform_0.4.1
## [25] spam_2.11-1 spatstat.sparse_3.1-0
## [27] reticulate_1.40.0 cowplot_1.1.3
## [29] pbapply_1.7-2 RColorBrewer_1.1-3
## [31] ResidualMatrix_1.16.0 abind_1.4-8
## [33] zlibbioc_1.52.0 Rtsne_0.17
## [35] purrr_1.0.4 bigassertr_0.1.6
## [37] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
## [39] irlba_2.3.5.1 listenv_0.9.1
## [41] spatstat.utils_3.1-2 goftest_1.2-3
## [43] RSpectra_0.16-2 dqrng_0.4.1
## [45] spatstat.random_3.3-2 fitdistrplus_1.2-2
## [47] parallelly_1.42.0 DelayedMatrixStats_1.28.1
## [49] pkgdown_2.1.1 codetools_0.2-20
## [51] DelayedArray_0.32.0 tidyselect_1.2.1
## [53] UCSC.utils_1.2.0 farver_2.1.2
## [55] viridis_0.6.5 ScaledMatrix_1.14.0
## [57] bigstatsr_1.6.1 spatstat.explore_3.3-4
## [59] flock_0.7 jsonlite_1.8.9
## [61] BiocNeighbors_2.0.1 progressr_0.15.1
## [63] ggridges_0.5.6 survival_3.7-0
## [65] iterators_1.0.14 systemfonts_1.2.1
## [67] foreach_1.5.2 tools_4.4.2
## [69] ragg_1.3.3 ica_1.0-3
## [71] Rcpp_1.0.14 glue_1.8.0
## [73] gridExtra_2.3 SparseArray_1.6.1
## [75] xfun_0.50 withr_3.0.2
## [77] fastmap_1.2.0 rsvd_1.0.5
## [79] digest_0.6.37 R6_2.5.1
## [81] mime_0.12 textshaping_1.0.0
## [83] colorspace_2.1-1 scattermore_1.2
## [85] tensor_1.5 spatstat.data_3.1-4
## [87] tidyr_1.3.1 generics_0.1.3
## [89] data.table_1.16.4 FNN_1.1.4.1
## [91] httr_1.4.7 htmlwidgets_1.6.4
## [93] S4Arrays_1.6.0 uwot_0.2.2
## [95] pkgconfig_2.0.3 gtable_0.3.6
## [97] lmtest_0.9-40 XVector_0.46.0
## [99] htmltools_0.5.8.1 dotCall64_1.2
## [101] scales_1.3.0 png_0.1-8
## [103] spatstat.univar_3.1-1 knitr_1.49
## [105] reshape2_1.4.4 checkmate_2.3.2
## [107] nlme_3.1-166 cachem_1.1.0
## [109] zoo_1.8-12 stringr_1.5.1
## [111] rmio_0.4.0 KernSmooth_2.23-24
## [113] vipor_0.4.7 parallel_4.4.2
## [115] miniUI_0.1.1.1 desc_1.4.3
## [117] pillar_1.10.1 grid_4.4.2
## [119] vctrs_0.6.5 RANN_2.6.2
## [121] promises_1.3.2 BiocSingular_1.22.0
## [123] ff_4.5.2 beachmat_2.22.0
## [125] xtable_1.8-4 cluster_2.1.6
## [127] beeswarm_0.4.0 evaluate_1.0.3
## [129] cli_3.6.3 locfit_1.5-9.11
## [131] compiler_4.4.2 rlang_1.1.5
## [133] crayon_1.5.3 future.apply_1.11.3
## [135] labeling_0.4.3 ps_1.8.1
## [137] ggbeeswarm_0.7.2 plyr_1.8.9
## [139] fs_1.6.5 stringi_1.8.4
## [141] viridisLite_0.4.2 deldir_2.0-4
## [143] BiocParallel_1.40.0 munsell_0.5.1
## [145] lazyeval_0.2.2 spatstat.geom_3.3-5
## [147] Matrix_1.7-1 RcppHNSW_0.6.0
## [149] patchwork_1.3.0 sparseMatrixStats_1.18.0
## [151] future_1.34.0 statmod_1.5.0
## [153] shiny_1.10.0 ROCR_1.0-11
## [155] igraph_2.1.4 bslib_0.9.0
## [157] RcppPlanc_1.0.0 bit_4.5.0.1