Overview
cacheR provides function-level caching
for R workflows.
Typical use cases in basic statistics:
- Expensive summary statistics on large datasets (mean, variance, quantiles).
- Repeated computation of correlation matrices.
- Multi-step pipelines, e.g. preprocess → summarize → correlate.
The main idea:
- Wrap an existing function with
cacheFile(). - Call the new function as usual.
-
cacheRsaves results to disk and reuses them when inputs and function body have not changed.
1. Setup
For this vignette we’ll use a temporary cache directory:
cache_dir <- file.path(tempdir(), "cacheR-demo")
dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)
cache_dir
#> [1] "/tmp/RtmprLIVVd/cacheR-demo"2. Caching basic summary statistics
Assume we have a function that computes several statistics at once:
stat_summary <- function(x) {
message("Computing stats ...")
Sys.sleep(1) # pretend it's expensive
x <- na.omit(x)
list(
mean = mean(x),
variance = var(x),
sd = sd(x),
median = median(x)
)
}We create a cached version:
cached_stat_summary <- cacheFile(
cache_dir = cache_dir,
backend = "rds"
) %@% function(x) {
stat_summary(x)
}Now, compare first vs second call:
set.seed(1)
x <- rnorm(1e5)
system.time(res1 <- cached_stat_summary(x))
#> Computing stats ...
#> user system elapsed
#> 0.082 0.006 1.089
system.time(res2 <- cached_stat_summary(x))
#> user system elapsed
#> 0.011 0.000 0.011
res1
#> $mean
#> [1] -0.002244083
#>
#> $variance
#> [1] 1.007059
#>
#> $sd
#> [1] 1.003523
#>
#> $median
#> [1] 0.0008011329
identical(res1, res2)
#> [1] TRUE- First call: prints
"Computing stats ..."and takes ~1 second. - Second call with the same input: should be almost instant, reusing the cached result.
3. Caching a correlation matrix
Computing a full correlation matrix can be expensive for many
variables.
Let’s define a function that:
- standardizes the columns,
- computes the correlation matrix,
- optionally returns means and variances as well.
cor_with_moments <- function(mat) {
message("Computing correlations and moments ...")
Sys.sleep(1) # simulate expensive work
# Basic checks
stopifnot(is.matrix(mat) || is.data.frame(mat))
mat <- as.matrix(mat)
means <- colMeans(mat, na.rm = TRUE)
variances <- apply(mat, 2, var, na.rm = TRUE)
corr_mat <- stats::cor(mat, use = "pairwise.complete.obs")
list(
mean = means,
variance = variances,
cor = corr_mat
)
}Wrap it with cacheFile():
cached_cor_with_moments <- cacheFile(
cache_dir = cache_dir,
backend = "rds"
) %@% function(mat) {
cor_with_moments(mat)
}Test it:
set.seed(123)
mat <- matrix(rnorm(200 * 10), ncol = 10)
system.time(res1 <- cached_cor_with_moments(mat))
#> Computing correlations and moments ...
#> user system elapsed
#> 0.005 0.001 1.007
system.time(res2 <- cached_cor_with_moments(mat))
#> user system elapsed
#> 0.004 0.000 0.004
str(res1$cor)
#> num [1:10, 1:10] 1 -0.0277 -0.0302 -0.0464 -0.0513 ...
identical(res1$cor, res2$cor)
#> [1] TRUEAgain:
- First call: actually computes mean, variance and correlation.
- Second call: returns the result from disk, skipping the expensive computation.
4. Nested caching: reuse pre-processing
Suppose you often apply the same pre-processing (centering & scaling) before computing summary statistics or correlations.
Define a preprocessing function and a downstream function that uses it:
preprocess_matrix <- function(mat) {
message("Preprocessing matrix (center & scale) ...")
Sys.sleep(1) # simulate expensive step
scale(mat)
}
# Cached preprocessing
cached_preprocess_matrix <- cacheFile(
cache_dir = cache_dir
) %@% function(mat) {
preprocess_matrix(mat)
}Now a downstream function that uses the preprocessed data to compute means, variances and correlations:
analyze_scaled <- cacheFile(
cache_dir = cache_dir
) %@% function(mat) {
message("Analyzing scaled data ...")
mat_scaled <- cached_preprocess_matrix(mat)
means <- colMeans(mat_scaled)
variances <- apply(mat_scaled, 2, var)
corr_mat <- stats::cor(mat_scaled)
list(
mean = means,
variance = variances,
cor = corr_mat
)
}Run it twice:
set.seed(321)
mat2 <- matrix(rnorm(300 * 8), ncol = 8)
system.time(a1 <- analyze_scaled(mat2))
#> Analyzing scaled data ...
#> Preprocessing matrix (center & scale) ...
#> user system elapsed
#> 0.009 0.001 1.010
system.time(a2 <- analyze_scaled(mat2))
#> user system elapsed
#> 0.004 0.000 0.004
identical(a1$cor, a2$cor)
#> [1] TRUEWhat happens:
-
First run
-
cached_preprocess_matrix()is executed and cached. -
analyze_scaled()is executed and cached.
-
-
Second run
- Both the preprocessing step and the full analysis should be loaded from cache (assuming the function bodies and input matrix are unchanged).
If your version of cacheR supports a cache
tree, you can inspect the relationships between cached
calls:
cacheTree_reset()
a <- analyze_scaled(mat2)
nodes <- cacheTree_nodes()
nodes
#> $analyze_scaled_efb90c085b419959
#> $analyze_scaled_efb90c085b419959$id
#> [1] "analyze_scaled_efb90c085b419959"
#>
#> $analyze_scaled_efb90c085b419959$type
#> [1] "function"
#>
#> $analyze_scaled_efb90c085b419959$label
#> [1] "analyze_scaled"
#>
#> $analyze_scaled_efb90c085b419959$code
#> [1] "analyze_scaled(mat2)"
#>
#> $analyze_scaled_efb90c085b419959$files
#> character(0)
#>
#> $analyze_scaled_efb90c085b419959$file_hashes
#> list()
#>
#> $analyze_scaled_efb90c085b419959$children
#> character(0)
#>
#> $analyze_scaled_efb90c085b419959$parents
#> character(0)
#>
#> $analyze_scaled_efb90c085b419959$time
#> [1] "2026-04-15 13:25:50 UTC"You should see at least two nodes:
- one for
analyze_scaled(mat2) - one for
cached_preprocess_matrix(mat2)
Visualizing the cache graph
If you have the igraph package installed, you can
visualize the dependency tree with plot_cache_graph().
Nodes are coloured by state: navy = cached,
amber = stale, blue = file
dependency.
if (requireNamespace("igraph", quietly = TRUE)) {
plot_cache_graph()
}
You can also save the plot to a file:
plot_cache_graph(output = "cache_graph.png")5. Using the qs2 backend for faster serialization
For large matrices, the qs2 backend can be
faster than base RDS.
Install it once:
install.packages("qs2")Then create a cached function with backend = "qs2":
cached_cor_qs2 <- cacheFile(
cache_dir = cache_dir,
backend = "qs2"
) %@% function(mat) {
cor_with_moments(mat)
}Compare timings:
set.seed(999)
mat_big <- matrix(rnorm(500 * 50), ncol = 50)
system.time(r1 <- cached_cor_qs2(mat_big))
#> Computing correlations and moments ...
#> user system elapsed
#> 0.015 0.002 1.018
system.time(r2 <- cached_cor_qs2(mat_big))
#> user system elapsed
#> 0.003 0.000 0.003
identical(r1$cor, r2$cor)
#> [1] TRUE
list.files(cache_dir, recursive = TRUE)
#> [1] "analyze_scaled.efb90c085b419959.qs2"
#> [2] "analyze_scaled.efb90c085b419959.qs2.lock"
#> [3] "cached_cor_qs2.035df2a873384780.qs2"
#> [4] "cached_cor_qs2.035df2a873384780.qs2.lock"
#> [5] "cached_cor_with_moments.6d2577311e2c809f.rds"
#> [6] "cached_cor_with_moments.6d2577311e2c809f.rds.lock"
#> [7] "cached_preprocess_matrix.fc04643ef08d65ac.qs2"
#> [8] "cached_preprocess_matrix.fc04643ef08d65ac.qs2.lock"
#> [9] "cached_stat_summary.f1865b7918927ae3.rds"
#> [10] "cached_stat_summary.f1865b7918927ae3.rds.lock"
#> [11] "graph.rds"
#> [12] "graph.rds.lock"You should see some .qs2 files in the cache
directory.
6. Combining file I/O and statistics
Create an example CSV file
# We use tempdir() so that the vignette can be built anywhere
# without needing any external data files.
set.seed(1)
# Small numeric toy dataset
my_data <- data.frame(
x = rnorm(100, mean = 0, sd = 1),
y = rnorm(100, mean = 5, sd = 2),
z = rnorm(100, mean = -2, sd = 0.5)
)
# Path where we'll save the CSV
example_csv <- file.path(tempdir(), "my_data.csv")
# Write the CSV (no row names to keep it clean)
write.csv(my_data, example_csv, row.names = FALSE)In practice, you often:
- Read data from a file (CSV, RDS, etc.),
- Compute summary statistics or correlations.
You can cache both steps.
read_numeric_csv <- function(path) {
message("Reading CSV: ", path)
Sys.sleep(1) # simulate slow disk / network
read.csv(path)
}
cached_read_numeric_csv <- cacheFile(
cache_dir = cache_dir
) %@% function(path) {
read_numeric_csv(path)
}
cached_file_stats <- cacheFile(
cache_dir = cache_dir
) %@% function(path) {
df <- cached_read_numeric_csv(path)
cor_with_moments(df)
}Usage:
res <- cached_file_stats(example_csv)
#> Reading CSV: /tmp/RtmprLIVVd/my_data.csv
#> Computing correlations and moments ...
res$mean
#> x y z
#> 0.1088874 4.9243838 -1.9851632
res$cor
#> x y z
#> x 1.0000000000 -0.0009943199 0.01838219
#> y -0.0009943199 1.0000000000 -0.04953621
#> z 0.0183821868 -0.0495362135 1.00000000- First run: reads the CSV and computes stats.
- Subsequent runs with the same file and same code: reuse cached results.
(Depending on your cacheR version, there may be options
such as file_args to incorporate directory or file contents
into the cache key more explicitly.)
7. Practical tips
-
Use a project-local cache directory, e.g.:
cache_dir <- ".cacheR" dir.create(cache_dir, showWarnings = FALSE) -
Add it to
.gitignore:.cacheR/ -
Wrap expensive operations:
- reading large files
- computing big correlation matrices
- heavy preprocessing
-
For pure statistical pipelines, a common pattern is:
-
cached_read()– load data -
cached_preprocess()– clean / transform -
cached_stats()– compute means, variances, correlations
Each step can be shared across many analyses.
-