Skip to content

Is there a way to cmorize out-of-memory datasets ? #166

@aulemahal

Description

@aulemahal

Hi!

I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in py-cordex will load the data, making dask useless.

I think I see that the in-memory requirement comes from cmor itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.

What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?

Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the full dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs.

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?

Thanks and sorry for the long issue that's not a real issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions