Reproducible and Fast RNA-seq Analysis with cacheR
Source:vignettes/cacheR-rnaseq.Rmd
cacheR-rnaseq.RmdcacheR RNA-seq Caching Example
This document describes how to use the cacheR package to
cache expensive steps in an RNA-seq data analysis pipeline. It focuses
on:
Importing RNA-seq count files from disk
Building a
DESeqDataSetFitting a DESeq model
Automatically invalidating caches when:
Function arguments change
Function code changes
The number of files in key directories changes
Hardcoded and global-variable paths change The examples assume you are working inside an R package or project that includes the
cacheFile()decorator fromcacheR.
1. Setup
1.1 Required packages You will need:
-
cacheRfor caching and dependency tracking -
DESeq2for differential expression analysis -
dplyr,readrfor data handling
1.2 Working directories We use a temporary working directory for demonstration. In a real project, you would use a project-specific path.
work_dir <- file.path(tempdir(), "cacheR_rnaseq_vignette")
dir.create(work_dir, recursive = TRUE, showWarnings = FALSE)
counts_dir <- file.path(work_dir, "counts")
dir.create(counts_dir, recursive = TRUE, showWarnings = FALSE)
cache_dir <- file.path(work_dir, "cache")
dir.create(cache_dir, recursive = TRUE, showWarnings = FALSE)2. Simulated RNA-seq data To keep the example self-contained, we
simulate small count tables as if they were generated by tools such as
featureCounts or htseq-count.
set.seed(1)
genes <- paste0("Gene", sprintf("%03d", 1:500))
simulate_sample <- function(sample_id, n_genes = length(genes)) {
data.frame(
gene_id = genes,
count = rpois(n_genes, lambda = sample(10:1000, 1)), stringsAsFactors = FALSE )
}
sample_ids <- paste0("sample", 1:4)
for (sid in sample_ids) {
tab <- simulate_sample(sid)
readr::write_tsv(tab, file.path(counts_dir, paste0(sid, "_counts.tsv")) )
}
sample_table <- data.frame( sample_id = sample_ids, condition = rep(c("A", "B"), each = 2), stringsAsFactors = FALSE )At this point: - counts_dir contains 4 tab-separated
count files, one per sample. - sample_table describes the
experimental design (two conditions: A and B).
3. Reading count files from a directory
3.1 Uncached helper function
The following function reads all *_counts.tsv files from
a directory, merges them on gene_id, and orders columns to
match the sample table.
read_counts_from_dir <- function(count_dir, sample_table) {
files <- list.files(count_dir, pattern = "_counts\\.tsv$", full.names = TRUE)
if (length(files) == 0L) {
stop("No count files found in: ", count_dir)
}
sample_ids <- gsub("_counts\\.tsv$", "", basename(files))
counts_list <- lapply(files, function(f) {
readr::read_tsv(f, show_col_types = FALSE)
})
merged <- counts_list[[1]][, "gene_id", drop = FALSE]
for (i in seq_along(counts_list)) {
merged[[sample_ids[i]]] <- counts_list[[i]]$count }
merged <- merged[, c("gene_id", sample_table$sample_id), drop = FALSE]
merged
}3.2 Caching the file import step
We now wrap read_counts_from_dir() with
cacheFile(). The key point is the argument
file_args = "count_dir": - cacheR will include
the number of files in count_dir as part
of the cache key. - If new files are added or removed, the hash changes
and the function is re-run.
Second run: same arguments and same files → uses cache
counts2 <- cached_read_counts(counts_dir, sample_table)
identical(counts1, counts2)should be TRUE
4. Automatic invalidation when new samples appear
If you add a new sample and its counts file to
counts_dir, the cache should be invalidated because the
number of files in count_dir changed.
new_id <- "sample5"
sample_table2 <- bind_rows( sample_table, data.frame(sample_id = new_id, condition = "A") )
tab_new <- simulate_sample(new_id)
readr::write_tsv(tab_new, file.path(counts_dir, paste0(new_id, "_counts.tsv")) )
counts3 <- cached_read_counts(counts_dir, sample_table2)
dim(counts1)dimensions before adding sample5 dim(counts3)
dimensions after adding sample5
Key behavior: - cached_read_counts() detects that
counts_dir now has 5 files instead of 4. - The cache key
changes, so the function is re-run. - counts3 now contains
the additional sample5 column.
5. Building a DESeqDataSet with caching
Constructing a DESeqDataSet can be computationally
expensive for large matrices. We can cache this step as well.
5.1 Cached DESeqDataSet construction
dds_from_counts <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function(counts_tbl, sample_table) {
mat <- as.matrix(counts_tbl[, sample_table$sample_id])
rownames(mat) <- counts_tbl$gene_id
coldata <- as.data.frame(sample_table)
rownames(coldata) <- coldata$sample_id
DESeqDataSetFromMatrix(
countData = mat,
colData = coldata,
design = ~condition
)
}6. Caching the DESeq model fit
Fitting the DESeq model is often the slowest step in the pipeline.
7. Tracking hardcoded directories and global variables
cacheR can also track directories that are not passed as
function arguments, but appear in:
Hardcoded calls like
list.files("/path/to/data")Global variable calls like
dir(GLOBAL_DIR)
7.1 Global directory example
GLOBAL_DIR <- counts_dir
hardcoded_fun <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function() {
# Using a global variable as directory path
n1 <- length(list.files(GLOBAL_DIR))
n2 <- length(dir(GLOBAL_DIR))
c(literal = n1, global = n2)
}
# First call: computes and caches
hardcoded_fun()
# Add another file to the directory
extra <- simulate_sample("extra")
readr::write_tsv(extra, file.path(GLOBAL_DIR, "extra_counts.tsv") )
# Second call: cache is invalidated and recomputed
hardcoded_fun()Here: - The function body uses GLOBAL_DIR inside
list.files() and dir(). - cacheR
inspects the function body, detects these directory usages, and includes
the corresponding file counts in the cache key. - When a new file is
added to GLOBAL_DIR, the cache entry becomes invalid, and
hardcoded_fun() is re-run.
8. Practical notes
Where to store cache files
Use a project-specific directory, for example cache/ in
your project root, or a configurable location via
options(cacheR.dir = ...).
Choosing what to cache
Good candidates are:
Disk I/O heavy operations (reading many files)
Large object construction (
DESeqDataSet,SummarizedExperiment, etc.)Expensive model fitting steps -
What not to cache
Very fast operations (simple filtering, small transforms) rarely benefit from caching and can add overhead.