-
Notifications
You must be signed in to change notification settings - Fork 415
Description
Environment Details
- SDV version: 1.33.1
Error Description
SDV offers many different demo datasets for testing. When downloading a dataset, SDV returns the actual data as well as the SDV metadata file that describes it.
For the financial dataset, the metadata file does describe the data. However for each table, it lists the columns in a different order than the actual data. Ultimately, SDV is able to model and sample synthetic data for it. However, SDV always looks at the metadata as a ground source-of-truth, so the synthetic data follows the order of the metadata (which is different from the original data).
Expected Behavior
The metadata should be updated. I expect that the downloaded metadata for the financial demo should list the columns in the same order as the actual data tables (from left to right).
Steps to reproduce
Metadata is out-of-order: Download the financial demo dataset. If you inspect the data for a particular table (say account), observe the order of the columns
from sdv.datasets.demo import download_demo
data, metadata = download_demo(
modality='multi_table',
dataset_name='financial')
data['account'].head()
But notice that the metadata has a different order.
print(metadata)
Now, if you create synthetic data, it will (correctly) follow the order of the metadata. But since the metadata order doesn't match the real data, there is a mismatch.
from sdv.multi_table import HSASynthesizer
synthesizer = HSASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer(scale=1.0)
synthetic_data['account'].head()