Reproducible and Fast RNA-seq Analysis with cacheR

cacheR RNA-seq Caching Example

This document describes how to use the cacheR package to cache expensive steps in an RNA-seq data analysis pipeline. It focuses on:

Importing RNA-seq count files from disk
Building a DESeqDataSet
Fitting a DESeq model
Automatically invalidating caches when:
Function arguments change
Function code changes
The number of files in key directories changes
Hardcoded and global-variable paths change The examples assume you are working inside an R package or project that includes the cacheFile() decorator from cacheR.

1. Setup

1.1 Required packages You will need:

cacheR for caching and dependency tracking
DESeq2 for differential expression analysis
dplyr, readr for data handling

1.2 Working directories We use a temporary working directory for demonstration. In a real project, you would use a project-specific path.

work_dir <- file.path(tempdir(), "cacheR_rnaseq_vignette") 
dir.create(work_dir, recursive = TRUE, showWarnings = FALSE) 

counts_dir <- file.path(work_dir, "counts") 
dir.create(counts_dir, recursive = TRUE, showWarnings = FALSE) 

cache_dir <- file.path(work_dir, "cache") 
dir.create(cache_dir, recursive = TRUE, showWarnings = FALSE)

2. Simulated RNA-seq data To keep the example self-contained, we simulate small count tables as if they were generated by tools such as `featureCounts` or `htseq-count`.


set.seed(1) 

genes <- paste0("Gene", sprintf("%03d", 1:500)) 

simulate_sample <- function(sample_id, n_genes = length(genes)) { 
    data.frame( 
        gene_id = genes, 
        count = rpois(n_genes, lambda = sample(10:1000, 1)), stringsAsFactors = FALSE ) 
} 
        
sample_ids <- paste0("sample", 1:4) 

for (sid in sample_ids) { 
    tab <- simulate_sample(sid) 
    readr::write_tsv(tab, file.path(counts_dir, paste0(sid, "_counts.tsv")) ) 
} 
sample_table <- data.frame( sample_id = sample_ids, condition = rep(c("A", "B"), each = 2), stringsAsFactors = FALSE )

At this point: - counts_dir contains 4 tab-separated count files, one per sample. - sample_table describes the experimental design (two conditions: A and B).

3. Reading count files from a directory

3.1 Uncached helper function

The following function reads all *_counts.tsv files from a directory, merges them on gene_id, and orders columns to match the sample table.


read_counts_from_dir <- function(count_dir, sample_table) { 
files <- list.files(count_dir, pattern = "_counts\\.tsv$", full.names = TRUE) 
if (length(files) == 0L) { 
        stop("No count files found in: ", count_dir) 
} 
sample_ids <- gsub("_counts\\.tsv$", "", basename(files)) 
counts_list <- lapply(files, function(f) { 
        readr::read_tsv(f, show_col_types = FALSE) 
}) 
merged <- counts_list[[1]][, "gene_id", drop = FALSE] 
for (i in seq_along(counts_list)) { 
        merged[[sample_ids[i]]] <- counts_list[[i]]$count } 
        merged <- merged[, c("gene_id", sample_table$sample_id), drop = FALSE] 
        merged 
}

3.2 Caching the file import step

We now wrap read_counts_from_dir() with cacheFile(). The key point is the argument file_args = "count_dir": - cacheR will include the number of files in count_dir as part of the cache key. - If new files are added or removed, the hash changes and the function is re-run.

cached_read_counts <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function(count_dir, sample_table) { 
    read_counts_from_dir(count_dir, sample_table) 
}

3.3 First and second run


# First run: reads from disk and caches

counts1 <- cached_read_counts(counts_dir, sample_table)

Second run: same arguments and same files → uses cache

counts2 <- cached_read_counts(counts_dir, sample_table)

identical(counts1, counts2)

should be TRUE

4. Automatic invalidation when new samples appear

If you add a new sample and its counts file to counts_dir, the cache should be invalidated because the number of files in count_dir changed.

new_id <- "sample5" 
sample_table2 <- bind_rows( sample_table, data.frame(sample_id = new_id, condition = "A") ) 
tab_new <- simulate_sample(new_id) 
readr::write_tsv(tab_new, file.path(counts_dir, paste0(new_id, "_counts.tsv")) ) 
counts3 <- cached_read_counts(counts_dir, sample_table2) 

dim(counts1)

dimensions before adding sample5 dim(counts3)

dimensions after adding sample5

Key behavior: - cached_read_counts() detects that counts_dir now has 5 files instead of 4. - The cache key changes, so the function is re-run. - counts3 now contains the additional sample5 column.

5. Building a DESeqDataSet with caching

Constructing a DESeqDataSet can be computationally expensive for large matrices. We can cache this step as well.

5.1 Cached DESeqDataSet construction

dds_from_counts <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function(counts_tbl, sample_table) { 
       
        mat <- as.matrix(counts_tbl[, sample_table$sample_id]) 
        
        rownames(mat) <- counts_tbl$gene_id 
        
        coldata <- as.data.frame(sample_table) 
        
        rownames(coldata) <- coldata$sample_id

        DESeqDataSetFromMatrix( 
                countData = mat, 
                colData   = coldata, 
                design    = ~condition 
        ) 
        
}

5.2 Usage


dds <- dds_from_counts(counts3, sample_table2)
dds

# standard DESeq2 object

Re-running dds_from_counts(counts3, sample_table2) with unchanged inputs will load from the cache instead of recomputing.

6. Caching the DESeq model fit

Fitting the DESeq model is often the slowest step in the pipeline.

6.1 Cached DESeq fit


run_deseq <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function(dds) { 
        DESeq(dds) 
        } 
dds_fit <- run_deseq(dds)

If you run run_deseq(dds) again with the same dds, it will use the cached result rather than re-fitting the model.

6.2 Extracting results Extracting results from the fitted model is typically fast, so you may or may not want to cache it:


res <- results(dds_fit, contrast = c("condition", "B", "A")) 

head(res[order(res$padj), ])

7. Tracking hardcoded directories and global variables

cacheR can also track directories that are not passed as function arguments, but appear in:

Hardcoded calls like list.files("/path/to/data")
Global variable calls like dir(GLOBAL_DIR)

7.1 Global directory example


GLOBAL_DIR <- counts_dir 

hardcoded_fun <- cacheR:::cacheFile( cache_dir = cache_dir ) %@% function() {

        # Using a global variable as directory path 
        n1 <- length(list.files(GLOBAL_DIR)) 
        n2 <- length(dir(GLOBAL_DIR)) 
        c(literal = n1, global = n2) 
}

# First call: computes and caches 

hardcoded_fun()

# Add another file to the directory 

extra <- simulate_sample("extra") 

readr::write_tsv(extra, file.path(GLOBAL_DIR, "extra_counts.tsv") )

# Second call: cache is invalidated and recomputed 

hardcoded_fun()

Here: - The function body uses GLOBAL_DIR inside list.files() and dir(). - cacheR inspects the function body, detects these directory usages, and includes the corresponding file counts in the cache key. - When a new file is added to GLOBAL_DIR, the cache entry becomes invalid, and hardcoded_fun() is re-run.

8. Practical notes

Where to store cache files

Use a project-specific directory, for example cache/ in your project root, or a configurable location via options(cacheR.dir = ...).

Choosing what to cache

Good candidates are:

Disk I/O heavy operations (reading many files)
Large object construction (DESeqDataSet, SummarizedExperiment, etc.)
Expensive model fitting steps -

What not to cache

Very fast operations (simple filtering, small transforms) rarely benefit from caching and can add overhead.

Session Info

sessionInfo()