Submission by Benels, containing code for both the Flat and Sequential Data Challenges.
Since the solution is basically a fine-tuned version of mostlyai-engine, the dependencies are the same as the gpu version of the engine, as they're described in the original pyproject.toml file.
In order to run an experiment, the user has to put the test csv in the same folder as the main.py of the solution, then follow the specific commands of the respective Flat or Sequential challenges, by running the main.py on GPU (on the AWS EC2 g5.2xlarge GPU as specified in the competition description).
Please run the following command in case of Flat challenge evaluation:
python main.py [CSV FILE NAME].csv --folder_name [SAVE FOLDER NAME] --choice flat- [CSV FILE NAME]: Substitute this with your file name (e.g.,
flat_training) - [SAVE FOLDER NAME]: Specify the folder where results will be saved (e.g.,
flat_training_folder). Using the same folder name multiple times will overwrite the contents of the folders, including the generated csv file! - The user will find the output csv in the save folder, named synthetic_flat.csv.
For instance, a complete version of the command could be:
python main.py flat_test.csv --folder_name benels_flat_submission --choice flatPlease run the following command in case of Sequential challenge evaluation:
python main.py [CSV FILE NAME].csv --folder_name [SAVE FOLDER NAME] --choice sequential- [CSV FILE NAME]: Substitute this with your file name (e.g.,
sequential_training) - [SAVE FOLDER NAME]: Specify the folder where results will be saved (e.g.,
sequential_training_folder). Using the same folder name multiple times will overwrite the contents of the folders, including the generated csv file! - The user will find the output csv in the save folder, named synthetic_seq.csv.
For instance, a complete version of the command could be:
python main.py sequential_test.csv --folder_name benels_sequential_submission --choice sequential Original MostlyAI Readme file will be left below for reference:
Documentation | Technical Paper | Free Cloud Service
Create high-fidelity privacy-safe synthetic data:
- prepare, analyze, and encode original data
- train a generative model on the encoded data
- generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missings
- incorporate fairness
- adjust sampling temperature
...all within your safe compute environment, all with a few lines of Python code 💥.
Note: This library is the underlying model engine of the Synthetic Data SDK. Please refer to the latter, for an easy-to-use, higher-level software toolkit.
The latest release of mostlyai-engine can be installed via pip:
pip install -U mostlyai-engineor alternatively for a GPU setup (needed for LLM finetuning and inference):
pip install -U 'mostlyai-engine[gpu]'On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-engine:
pip install -U torch==2.6.0+cpu torchvision==0.21.0+cpu mostlyai-engine --extra-index-url https://download.pytorch.org/whl/cpufrom pathlib import Path
import pandas as pd
from mostlyai import engine
# set up workspace and default logging
ws = Path("ws-tabular-flat")
engine.init_logging()
# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census"
trn_df = pd.read_csv(f"{url}/census.csv.gz")
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
workspace_dir=ws,
tgt_data=trn_df,
model_type="TABULAR",
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelData/tgt-stats/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=1, # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws) # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic datafrom pathlib import Path
import pandas as pd
from mostlyai import engine
engine.init_logging()
# set up workspace and default logging
ws = Path("ws-tabular-sequential")
engine.init_logging()
# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball"
trn_ctx_df = pd.read_csv(f"{url}/players.csv.gz") # context data
trn_tgt_df = pd.read_csv(f"{url}/batting.csv.gz") # target data
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/(tgt|ctx)-data`
workspace_dir=ws,
tgt_data=trn_tgt_df,
ctx_data=trn_ctx_df,
tgt_context_key="players_id",
ctx_primary_key="id",
model_type="TABULAR",
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelStore/(tgt|ctx)-data/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=1, # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws) # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic datafrom pathlib import Path
import pandas as pd
from mostlyai import engine
# init workspace and logging
ws = Path("ws-language-flat")
engine.init_logging()
# load original data
trn_df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/headlines/headlines.parquet")
trn_df = trn_df.sample(n=10_000, random_state=42)
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
workspace_dir=ws,
tgt_data=trn_df,
tgt_encoding_types={
'category': 'LANGUAGE_CATEGORICAL',
'date': 'LANGUAGE_DATETIME',
'headline': 'LANGUAGE_TEXT',
}
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelStore/tgt-stats/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=2, # limit TRAIN to 2 minute for demo purposes
model="MOSTLY_AI/LSTMFromScratch-3m", # use a light-weight LSTM model, trained from scratch (GPU recommended)
# model="microsoft/phi-1.5", # alternatively use a pre-trained HF-hosted LLM model (GPU required)
)
engine.generate( # use model to generate synthetic samples to `{ws}/SyntheticData`
workspace_dir=ws,
sample_size=10,
)
pd.read_parquet(ws / "SyntheticData") # load synthetic data