Skip to content

[ENH] add feature scaling support for EncoderDecoderDataModule#1983

Open
PranavBhatP wants to merge 40 commits intosktime:mainfrom
PranavBhatP:feature-scaling
Open

[ENH] add feature scaling support for EncoderDecoderDataModule#1983
PranavBhatP wants to merge 40 commits intosktime:mainfrom
PranavBhatP:feature-scaling

Conversation

@PranavBhatP
Copy link
Contributor

@PranavBhatP PranavBhatP commented Oct 12, 2025

Reference Issues/PRs

fixes #1979

What does this implement/fix? Explain your changes.

Implements feature scaling for continuous targets and features in the D2 layer using input scalers and target_normalizer params inside the

Current state of this PR:

  • Extend functionality of existing preprocess_data to handle scaling of target and continuous features using TorchNormalizer, , EncoderNormalizer, StandardScaler and RobustScaler .
  • Fits all scalers and normalizers during setup and generates a preprocessed version of the dataset to feed into the respective train-val-test splits of the actual dataset.
  • NOTE: The EncoderNormalizer unlike the others, needs to be fit on every encoding sequence and transform the sequence "on-the-fly". The logic for fitting and normalizing here is separate and is done inside __getitem__ when the D2 dataset feeds data into the model during training.

Issues (EDIT: resolved):

  • Issues arises with usage of sklearn scalers. sklearn scalers accept only numpy or pandas data. However when we cast the torch data (from the TimeSeries D1 object) into pandas/numpy data the computation graph is broken. Refer comment.

UPDATE: THIS HAS BEEN FIXED. We move preprocessing to the datamodule level (during setup()) instead of calling the transform inside the d2 dataset getitem. This ensures the we don't break the computation by completely moving the normalization pipeline away from the actual training stage, but perform it during the data module's setup stage.

DRAWBACK: We cannot work with larger-than-memory data formats with this method. The trade-off here is loosing compatiblity with sklearn scalers.

What should a reviewer concentrate their feedback on?

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 87.93970% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7125289). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/data_module.py 87.79% 21 Missing ⚠️
pytorch_forecasting/base/_base_pkg.py 88.88% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1983   +/-   ##
=======================================
  Coverage        ?   86.63%           
=======================================
  Files           ?      165           
  Lines           ?     9919           
  Branches        ?        0           
=======================================
  Hits            ?     8593           
  Misses          ?     1326           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.63% <87.93%> (?)
pytest 86.63% <87.93%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PranavBhatP PranavBhatP marked this pull request as ready for review October 17, 2025 11:08
@PranavBhatP
Copy link
Contributor Author

Tests are failing on the notebook. Seems like its an issue caused by the training script in the notebook. Will try changing accelator = "cpu". Might resolve the issue.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@phoeenniixx
Copy link
Member

I think we could look at v1 implementation? Maybe that could help us with this issue, how the sklearn scalers are handled there?

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Oct 21, 2025

Yes we will need a design doc to analyse v1 implementation. This is the best way forward.

@fkiraly
Copy link
Collaborator

fkiraly commented Oct 22, 2025

agree - it seems like one more rabbit hole...

Is there perhaps something we can learn from pytorch-tabular?

@phoeenniixx phoeenniixx mentioned this pull request Oct 28, 2025
6 tasks
@PranavBhatP
Copy link
Contributor Author

I've replaced the loss function used in the notebook with nn.L1Loss(). I cannot think of a fix to the current issue with PTF metrics. Any thoughts yet, about what is causing it? @phoeenniixx @agobbifbk @fkiraly? Do we proceed with this PR by running the model in the notebook with nn metrics?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I request an explanation why the notebook is running prior to the changes but now failing?

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Nov 3, 2025

May I request an explanation why the notebook is running prior to the changes but now failing?

I did not get your question. Are you referring to why the tests were failing priority to the changes in the most recent commits? It is passing now...For context, the tests were failing on the notebook because training simply does not seem to work when we use in-house PTF metrics. More context on the initial failing test - https://github.com/sktime/pytorch-forecasting/actions/runs/18570677570/job/52943552548

However, the pipeline works perfectly fine when we use nn metrics. That is why we replace the metric parameter for the TFT model with nn.L1Loss() instead MAE().

@agobbifbk
Copy link

Hello @PranavBhatP I looked at the scalers issue that we cant fit sklearn scalers in the datamodule as this breaks the computation graph. I have some comments, please lmk if you think this could work.

Currently we call _preprocess_data inside __getitem__ (see here). This means whenever we perform scaling (inside _preprocess_data), we perform the scaling INSIDE the nested dataset class, which is then passed to the dataloaders. So, actually scaling is happing DURING training, when dataloader calls the __getitem__ of the dataset. And when we try calling sklearn scalers, it breaks the computation graph.

So, I looked at pytorch-tabular - how they actually solve this issue:

  • They perform scaling BEFORE training as preprocess_data (where scaling etc is happening) is called not in __getitem__ but in setup. (see here)
  • The training dataloaders are not created at this time, they just cache the datasets at this time (see here). These cached datasets have the scaled normalised data - they dont scale one __getitem__ at a time, rather the whole train-split, val-split etc are scaled, normalised at once.
  • These cached datasets are then saved in the respective train_dataset, val_dataset properties. (see here), which are then sent to the dataloaders (see here)
  • These dataloaders are then called directly during the training (see here)

I think we should do something similar, do the scaling BEFORE the training, not DURING the training.

What are your thoughts about it? Also FYI @agobbifbk @fkiraly

If we want to be able to work with data that can not fit in memory the scalers transform method must happen in the __getitem__ function, during the training. Precomputing the scaled data in the different data loader can be done only if we are sure that the whole data can fit in memory. I know that this is time consuming and inefficient but if we want to keep this feature I don't see any other possible solution. We can rewrite the partial fit for the StandardScaler and MinMaxScaler in pure torch so that we don't need to pass through numpy but still the transformation should happen in the __getitem__ . We can decide that this d2 layer is only compliant with the d1 layer with a single dataframe that fits in memory and postpone the problem but probably with a little effort we can manage to do both with a single flag (easier also to maintain). @fkiraly what do you think?

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Dec 19, 2025

@agobbifbk I say we merge with whatever we have (after review) and extend scaling via partial_fit to larger-than-memory datasets in a new PR?

@agobbifbk
Copy link

Ok!

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Dec 28, 2025

@fkiraly @phoeenniixx @agobbifbk would appreciate another review, this PR is complete from my side, do take a look whenever ur free. Thanks.

Copy link
Member

@phoeenniixx phoeenniixx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Please add the complete docstrings in the functions and also remove the todos that are met by this PR.
I also have some suggestions:

  • I think we should also add the support of GroupNormalizer, MultiNormalizer of ptf (see the comment below).
  • I am a little worried about the test stage of datamodule, here, if we are NOT using the same datamodule that we used for fitting, the code may break as fitted scalers etc are not present?
    We should also think of saving the scalers, transformers in this case? I think this is where pkg layer comes in? I think we need some discussion about this?

What do you think? Also FYI @fkiraly @agobbifbk


elif stage == "test":
if not hasattr(self, "test_dataset"):
self._test_preprocessed = self._preprocess_all_data(self._test_indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are creating a new datamodule for testing (not using the dm that we used for fit), will this break as we dont have the fitted scalers etc here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will break, thanks for pointing that out, we need to figure out a way to persist the scalers..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it, it works as intended in the latest version, pls review it when possible.

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Jan 1, 2026

I think we should also add the support of GroupNormalizer, MultiNormalizer of ptf (see the comment below).

MultiNormalizer is something we could take a look at for sure, but the data from it would be unusable at the model layer since we dont support multi-target forecast in the D2.

GroupNormalizer would be much more harder to implement imo. It would demand much more change to the current architecture since it requires handling with pandas and so on, I think supporting groups will be separate rework/PR, it would require more time.

We should also think of saving the scalers, transformers in this case? I think this is where pkg layer comes in? I think we need some discussion about this?

Yes that is a valid concern, we need to globalise the transforms/scalers used. We need to figure out how to save the state also...across data modules. For now it throws a RuntimeError (see below commit)

@PranavBhatP
Copy link
Contributor Author

Looks like the lack of fitting will lead to failures..we have to add persistence before we proceed.

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Jan 5, 2026

We should also think of saving the scalers, transformers in this case? I think this is where pkg layer comes in? I think we need some discussion about this?

@phoeenniixx Could you give a rough mental idea about how the package layer would work on this? I am trying to come up with something and a reference would be great!

@PranavBhatP
Copy link
Contributor Author

i have added saving and loading of scalers. The PR is complete with the respect to current requirements.

@agobbifbk
Copy link

Nice one!
Just a comment: a more efficient and less memory demanding approach can rely on partial_fit like here. Since there are different scalers in this PR, and not all has the partial_fit method implemented I think we can solve it in a different issue. Can we also add MinMaxScaler or any scaler in sklearn? (so the user can pass as argument sklearn.preprocessing.MinMaxScaler(feature_range=(0.05, 0.95)) directly and evaluate the string afterwards (eval(scaler))?
THX!

@PranavBhatP
Copy link
Contributor Author

PranavBhatP commented Jan 24, 2026

Since there are different scalers in this PR, and not all has the partial_fit method implemented I think we can solve it in a different issue. Can we also add MinMaxScaler or any scaler in sklearn? (so the user can pass as argument sklearn.preprocessing.MinMaxScaler(feature_range=(0.05, 0.95)) directly and evaluate the string afterwards (eval(scaler))?

Maybe we need a torch version of partial_fit? That's for another issue I think. We can merge this one and work on that next or someone can take it up.

Adding MinMaxScaler also makes perfect sense, it would be a good addition in a future PR (good first issue?)

@PranavBhatP
Copy link
Contributor Author

@phoeenniixx @fkiraly I would like a review.

Copy link
Member

@phoeenniixx phoeenniixx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have added few suggestions related to loading and saving scalers.

if self.datamodule is not None:
self.datamodule.time_series_dataset = data
self.datamodule.setup(stage="predict")
return self.datamodule.predict_dataloader()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using the dm from the next line (line 181) here? I think we could directly use that to setup predict stage? (See line 193)

Copy link
Contributor Author

@PranavBhatP PranavBhatP Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this (according to my intrepretation) - the dm in line 181 is a new data module which is created from scratch using the input data, which in this case is the data you want to generate predictions for. But if the datamodule has already been setup for training and has fitted scalers, there is no necessity to load scalers from a saved state, you simply set the dataset to the data you want to predict on, and call setup(stage="predict") on this, right? It also makes the process more efficient without having to create new dataloaders, everytime we reuse the same datamodule in predict stage.

But there is also a tiny concern - since we are replacing the training data entirely, so we wont have access to the old data. But I don't see how this would cause any issues.

I hope this logic seems alright, if there is anything wrong do let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of overriding the data completely, I could create a copy and then pass the scaler states to the new datamodule, without need for a pickle file.

save_ckpt: bool = True,
ckpt_dir: str | Path = "checkpoints",
ckpt_kwargs: dict[str, Any] | None = None,
save_scalers: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose this param? I mean if the user is checkpointing the model, they will always save the scalers, no? Are there any cases when the user only uses the models and not the scalers?

Copy link
Contributor Author

@PranavBhatP PranavBhatP Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure about whether this param should be exposed, I was thinking maybe due to some reason the user might not want to save their scalers after training due to memory constraints or simply because they know that they are not going to be testing or using the scalers afterwards in a new session, hence won't require any kind of persistence?


return windows

def save_scalers(self, path: str | Path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate the concerns here?

  • The datamodule just does the data loading and the preprocessing as intended
  • The pkg layer actually saves and load these scalers?

What I mean is the logic of saving and loading should be moved to pkg layer. The datamodule just returns the instances of the scalers and accepts the instances the scalers. The pkg layer saves and load the scalers and then send them to the datamodule as instances?

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Add Support for Feature Scaling in D2 DataModule Layer.

4 participants