Skip to contents

cacheR tracks your data and code so you don’t have to

Also available as a Python package: pycacheR (pip install pycacheR)

What does cacheR do?

It automatically checks for changes in code and input data and re-runs the code if necessary.

It’s like snakemake/nextflow, but on the fly

cacheR cache graph animation

What is it useful for?

  • Keeping the analysis up to date

  • Saving time

  • Not using obsolete results

  • Reusing heavy computations safely and transparently


Installation

# From GitHub
remotes::install_github("BIMSBbioinfo/cacheR")
# From Guix
guix install -f https://raw.githubusercontent.com/BIMSBbioinfo/cacheR/main/guix.scm

Basic usage

The package introduces:

  • cacheFile() — a caching decorator
  • %@% — an operator for applying decorators
  • cacheTree_*() — functions for inspecting the cache tree
library(cacheR)

cache_dir <- file.path(tempdir(), "cache_test")
dir.create(cache_dir, recursive = TRUE, showWarnings = FALSE)

# Define cached functions
inner <- cacheFile(cache_dir) %@% function(x) x + 1
outer <- cacheFile(cache_dir) %@% function(x) inner(x) * 2

# Execute
outer(3)
#> 8

How does cacheR decide to recompute?

A cached call is reused only if all of the following are unchanged:

  • The function body (including inline code changes)
  • The arguments (up to hashing / comparison rules)
  • The tracked files / directories, where relevant
  • The package versions of any non-base functions used
  • The environment variables used by the function

If any of these change, cacheR invalidates the old entry and recomputes.

Limitations & caveats

  • Package boundaries:
    cacheR stops tracking when it hits a function imported from a package.
    Instead, it records the package name and version. It does not inspect the internals of those functions.

  • Native code / C / external tools:
    C/C++ code and external tools (e.g. system("bwa mem ...")) are not tracked. If they change, cacheR will not notice unless their inputs / outputs change in a tracked place.

  • Side effects:
    Functions with side effects (writing to global variables, random seeds, databases, etc.) are not fully “safe” to cache. Prefer pure, data-in/data-out functions.

When you probably shouldn’t use cacheR

  • Highly stateful / interactive code where caching would confuse you more than it helps
  • Situations where you need full workflow orchestration, scheduling, and cluster execution (use snakemake/nextflow/targets/etc. instead)