-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi!
I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in py-cordex will load the data, making dask useless.
I think I see that the in-memory requirement comes from cmor itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.
What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?
Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the full dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs.
Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?
Thanks and sorry for the long issue that's not a real issue.