Content addressed cache

### Checklist

- [x] I added a descriptive title
- [x] I searched open requests and couldn't find a duplicate

### What is the idea?

Deduplicate sha256-identical files in the `pkgs/` cache.

Right now we have something like this:

```bash
pkgs/
  package-1.0-0/
     file1
     file2
     file3
  package-2.0-0/
     file1
     file2
     file3
     file4
```

Across releases, many packages will keep identical files. Even more across variant builds. If the file "identity" (path) is instead its sha256, we can restructure the cache as such

```bash
pkgs/
  content-addressed/  # probably with some more subdirectory levels per hash prefix
    48/
      4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
    d0/
      d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
    14/
      146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
    ba/
      baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debad
  package-1.0-0/
     file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
     file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
     file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
  package-2.0-0/
     file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
     file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
     file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
     file4 -> ../content-addressed/ba/baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debad
```

with `->` denoting a hardlink.

There might be even more cache hits if we standardize the prefix placeholder to the path of the cache plus the necessary padding.

### Why is this needed?

My preliminary analysis show substantial storage savings:

- Global size 28.8GiB
- Dedup’d size 17.0GiB

I used this Marimo notebook:

```python
import marimo

__generated_with = "0.13.0"
app = marimo.App(width="medium")


@app.cell
def _():
    import json
    import os
    from pathlib import Path

    return Path, json


@app.cell
def _(Path):
    CONDA_PKGS = Path("~/PATH/TO/YOUR/PKGS/DIRS").expanduser()

    def sizeof_fmt(num, suffix="B"):
        for unit in ("", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"):
            if abs(num) < 1024.0:
                return f"{num:3.1f}{unit}{suffix}"
            num /= 1024.0
        return f"{num:.1f}Yi{suffix}"

    return CONDA_PKGS, sizeof_fmt


@app.cell
def _(CONDA_PKGS, json):
    paths = {}
    total_size = 0
    for paths_json in CONDA_PKGS.glob("**/info/paths.json"):
        for path in json.loads(paths_json.read_text()).get("paths", ()):
            size = path.get("size_in_bytes", 0)
            if size > paths.get(path["sha256"], 0):
                paths[path["sha256"]] = size
            total_size += size
    return paths, total_size


@app.cell
def _(paths, sizeof_fmt, total_size):
    print("Global size", sizeof_fmt(total_size))
    print("Dedup'd size", sizeof_fmt(sum(paths.values())))
    return


@app.cell
def _():
    return


if __name__ == "__main__":
    app.run()
```

### What should happen?

The user won't probably notice anything other than a leaner cache. This might come at a cost of overheads at extraction time, but over time they are likely to reduce because we'll have cache hits that will allow to skip some unnecessary IO (just create the hardlink in the respective place).

### Additional Context

Same as https://github.com/conda/conda/issues/14906

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content addressed cache #1383

Checklist

What is the idea?

Why is this needed?

What should happen?

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Content addressed cache #1383

Description

Checklist

What is the idea?

Why is this needed?

What should happen?

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions