cachepy¶
cachepy tracks your data and code so you don't have to.
Python port of cacheR.
What does cachepy do?¶
It automatically checks for changes in code and input data and re-runs the function only if necessary. Results are cached to disk as pickle files.
It's like snakemake/nextflow, but on the fly.
What is it useful for?¶
- Keeping analysis results up to date
- Saving time on expensive computations
- Not using obsolete results
- Reusing heavy computations safely and transparently
Quick example¶
from cachepy import cache_file
cache_dir = "/tmp/my_cache"
@cache_file(cache_dir)
def inner(x):
return x + 1
@cache_file(cache_dir)
def outer(x):
return inner(x) * 2
outer(3)
#> 8
How does cachepy decide to recompute?¶
A cached call is reused only if all of the following are unchanged:
- The function body (source code hash, including inline changes)
- The arguments (normalized and hashed — positional, named, and default-filled forms are equivalent)
- The tracked files / directories, where relevant (
file_args,depends_on_files) - The package versions of imported modules used by the function
- The environment variables specified via
env_vars - The version string, if provided
- Any external variables specified via
depends_on_vars
If any of these change, cachepy invalidates the old entry and recomputes.
Limitations & caveats¶
- Package boundaries: cachepy stops tracking when it hits a function imported from an installed package. Instead, it records the package name and version.
- Native code / C extensions: C/C++ extensions and external tools are not tracked.
- Side effects: Functions with side effects are not fully safe to cache. Prefer pure, data-in / data-out functions.
- Pickle limitations: Results must be picklable.
- Argument hashing: Objects with non-deterministic pickling may produce unstable hashes. NumPy arrays, pandas DataFrames, and PyTorch tensors are handled correctly.
- No distributed execution: cachepy is a single-machine, single-process cache.