Skip to content

Content addressed cache #1383

@jaimergp

Description

@jaimergp

Checklist

  • I added a descriptive title
  • I searched open requests and couldn't find a duplicate

What is the idea?

Deduplicate sha256-identical files in the pkgs/ cache.

Right now we have something like this:

pkgs/
  package-1.0-0/
     file1
     file2
     file3
  package-2.0-0/
     file1
     file2
     file3
     file4

Across releases, many packages will keep identical files. Even more across variant builds. If the file "identity" (path) is instead its sha256, we can restructure the cache as such

pkgs/
  content-addressed/  # probably with some more subdirectory levels per hash prefix
    48/
      4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
    d0/
      d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
    14/
      146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
    ba/
      baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debad
  package-1.0-0/
     file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
     file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
     file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
  package-2.0-0/
     file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
     file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
     file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
     file4 -> ../content-addressed/ba/baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debad

with -> denoting a hardlink.

There might be even more cache hits if we standardize the prefix placeholder to the path of the cache plus the necessary padding.

Why is this needed?

My preliminary analysis show substantial storage savings:

  • Global size 28.8GiB
  • Dedup’d size 17.0GiB

I used this Marimo notebook:

import marimo

__generated_with = "0.13.0"
app = marimo.App(width="medium")


@app.cell
def _():
    import json
    import os
    from pathlib import Path

    return Path, json


@app.cell
def _(Path):
    CONDA_PKGS = Path("~/PATH/TO/YOUR/PKGS/DIRS").expanduser()

    def sizeof_fmt(num, suffix="B"):
        for unit in ("", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"):
            if abs(num) < 1024.0:
                return f"{num:3.1f}{unit}{suffix}"
            num /= 1024.0
        return f"{num:.1f}Yi{suffix}"

    return CONDA_PKGS, sizeof_fmt


@app.cell
def _(CONDA_PKGS, json):
    paths = {}
    total_size = 0
    for paths_json in CONDA_PKGS.glob("**/info/paths.json"):
        for path in json.loads(paths_json.read_text()).get("paths", ()):
            size = path.get("size_in_bytes", 0)
            if size > paths.get(path["sha256"], 0):
                paths[path["sha256"]] = size
            total_size += size
    return paths, total_size


@app.cell
def _(paths, sizeof_fmt, total_size):
    print("Global size", sizeof_fmt(total_size))
    print("Dedup'd size", sizeof_fmt(sum(paths.values())))
    return


@app.cell
def _():
    return


if __name__ == "__main__":
    app.run()

What should happen?

The user won't probably notice anything other than a leaner cache. This might come at a cost of overheads at extraction time, but over time they are likely to reduce because we'll have cache hits that will allow to skip some unnecessary IO (just create the hardlink in the respective place).

Additional Context

Same as conda/conda#14906

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions