Skip to content

In the financial demo dataset, the metadata does not list columns in the correct order #2803

@npatki

Description

@npatki

Environment Details

  • SDV version: 1.33.1

Error Description

SDV offers many different demo datasets for testing. When downloading a dataset, SDV returns the actual data as well as the SDV metadata file that describes it.

For the financial dataset, the metadata file does describe the data. However for each table, it lists the columns in a different order than the actual data. Ultimately, SDV is able to model and sample synthetic data for it. However, SDV always looks at the metadata as a ground source-of-truth, so the synthetic data follows the order of the metadata (which is different from the original data).

Expected Behavior

The metadata should be updated. I expect that the downloaded metadata for the financial demo should list the columns in the same order as the actual data tables (from left to right).

Steps to reproduce

Metadata is out-of-order: Download the financial demo dataset. If you inspect the data for a particular table (say account), observe the order of the columns

from sdv.datasets.demo import download_demo

data, metadata = download_demo(
    modality='multi_table',
    dataset_name='financial')

data['account'].head()
Image

But notice that the metadata has a different order.

print(metadata)
Image

Now, if you create synthetic data, it will (correctly) follow the order of the metadata. But since the metadata order doesn't match the real data, there is a mismatch.

from sdv.multi_table import HSASynthesizer

synthesizer = HSASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer(scale=1.0)

synthetic_data['account'].head()
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions