Is there a way to cmorize out-of-memory datasets ?

Hi!

I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in `py-cordex` will load the data, making dask useless.

I think I see that the in-memory requirement comes from `cmor` itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.

What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?

Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the _full_ dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs. 

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with `xr.save_mfdataset` afterwards. Does that exist ?

Thanks and sorry for the long issue that's not a real issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to cmorize out-of-memory datasets ? #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is there a way to cmorize out-of-memory datasets ? #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions