Cache Dependency Graph — Deep Dive¶

cachepy automatically tracks which cached function called which and which files each depends on. This tutorial covers the full graph API.

Notebook version

This tutorial is also available as a Jupyter notebook: notebooks/04_cache_graph.ipynb

Building a Pipeline¶

When cached functions call other cached functions, parent → child edges are recorded automatically.

from cachepy import cache_file, cache_tree_nodes, cache_tree_reset

cache_dir = "/tmp/graph_demo"

@cache_file(cache_dir)
def load_counts(path):
    return {"genes": ["TP53", "BRCA1", "EGFR"], "counts": [100, 250, 80]}

@cache_file(cache_dir)
def normalize(counts):
    total = sum(counts["counts"])
    return {g: c / total for g, c in zip(counts["genes"], counts["counts"])}

@cache_file(cache_dir)
def pipeline(path):
    return normalize(load_counts(path))

result = pipeline("samples/counts.csv")

Inspecting Nodes¶

for nid, node in cache_tree_nodes().items():
    print(f"{node['fname']:15s}  parents={node.get('parents', [])}  "
          f"children={node.get('children', [])}")

Each node contains: fname, hash, parents, children, files, file_hashes, outfile.

File Tracking¶

track_file(path) registers a file dependency and stores its content hash:

from cachepy import track_file

@cache_file(cache_dir)
def read_data(path):
    p = track_file(path)  # registers dependency
    return p.read_text()

Staleness Detection¶

from cachepy import cache_tree_changed_files

stale = cache_tree_changed_files()
for nid, info in stale.items():
    print(f"{info['node']['fname']}: {info['changed_files']}")

Visualisation¶

from cachepy import plot_cache_graph

fig = plot_cache_graph(highlight_stale=True)

Node colours:

Colour	Meaning
Navy (#1D3557)	Cached and up-to-date
Amber (#FBBC04)	Stale — tracked file changed
Gray (#F1F3F5)	Cache file missing
Light blue (#E8F0FE)	Tracked file node

Save to file:

plot_cache_graph(output="graph.png")

Graph Persistence¶

from cachepy import cache_tree_save, cache_tree_load, cache_tree_sync

# Save / load
cache_tree_save("my_graph.pkl")
cache_tree_load("my_graph.pkl")

# Sync merges disk graph into memory (non-destructive)
cache_tree_sync(cache_dir)  # reads graph.pkl

The graph is also auto-persisted to graph.pkl in the cache directory on every function execution.

Querying by File¶

from cachepy import cache_tree_for_file

dependents = cache_tree_for_file("/path/to/data.tsv")
for nid, node in dependents.items():
    print(f"  {node['fname']}")

Complex DAG¶

Diamond-shaped dependencies (two branches merging) are tracked correctly:

@cache_file(cache_dir)
def integrate(sample):
    expr = branch_expression(sample)   # branch A
    muts = branch_mutations(sample)    # branch B
    return {g: expr[g] for g in muts if g in expr}

See the notebook for the full diamond DAG example with visualisation.