-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Checklist
- I added a descriptive title
- I searched open requests and couldn't find a duplicate
What is the idea?
Deduplicate sha256-identical files in the pkgs/ cache.
Right now we have something like this:
pkgs/
package-1.0-0/
file1
file2
file3
package-2.0-0/
file1
file2
file3
file4Across releases, many packages will keep identical files. Even more across variant builds. If the file "identity" (path) is instead its sha256, we can restructure the cache as such
pkgs/
content-addressed/ # probably with some more subdirectory levels per hash prefix
48/
4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
d0/
d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
14/
146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
ba/
baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debad
package-1.0-0/
file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
package-2.0-0/
file1 -> ../content-addressed/48/4819c16dc70219ac16888407d77e97a7774d5f877efec3998272248eed40b4b9
file2 -> ../content-addressed/d0/d0229d43699e06e8f4efc5de64bdffa2b41d92f1aeb4bd27981f46f322eb1847
file3 -> ../content-addressed/14/146e0dafbc5327c347f29ceffe904132a37c0b72b26819827d91832dfb8ec263
file4 -> ../content-addressed/ba/baed639523840f550125dc6f7ed12cccdbb5ee15f5d1ccf957155b13eb8debadwith -> denoting a hardlink.
There might be even more cache hits if we standardize the prefix placeholder to the path of the cache plus the necessary padding.
Why is this needed?
My preliminary analysis show substantial storage savings:
- Global size 28.8GiB
- Dedup’d size 17.0GiB
I used this Marimo notebook:
import marimo
__generated_with = "0.13.0"
app = marimo.App(width="medium")
@app.cell
def _():
import json
import os
from pathlib import Path
return Path, json
@app.cell
def _(Path):
CONDA_PKGS = Path("~/PATH/TO/YOUR/PKGS/DIRS").expanduser()
def sizeof_fmt(num, suffix="B"):
for unit in ("", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"):
if abs(num) < 1024.0:
return f"{num:3.1f}{unit}{suffix}"
num /= 1024.0
return f"{num:.1f}Yi{suffix}"
return CONDA_PKGS, sizeof_fmt
@app.cell
def _(CONDA_PKGS, json):
paths = {}
total_size = 0
for paths_json in CONDA_PKGS.glob("**/info/paths.json"):
for path in json.loads(paths_json.read_text()).get("paths", ()):
size = path.get("size_in_bytes", 0)
if size > paths.get(path["sha256"], 0):
paths[path["sha256"]] = size
total_size += size
return paths, total_size
@app.cell
def _(paths, sizeof_fmt, total_size):
print("Global size", sizeof_fmt(total_size))
print("Dedup'd size", sizeof_fmt(sum(paths.values())))
return
@app.cell
def _():
return
if __name__ == "__main__":
app.run()What should happen?
The user won't probably notice anything other than a leaner cache. This might come at a cost of overheads at extraction time, but over time they are likely to reduce because we'll have cache hits that will allow to skip some unnecessary IO (just create the hardlink in the respective place).
Additional Context
Same as conda/conda#14906