This repository provides a system to submit, run, and evaluate machine learning models on a private testset. It integrates a Discord bot user interface, a Flask API for leaderboards, a Celery task queue for asynchronous evaluation, Docker sandboxing for secure model execution, and a small database to store submissions and team scores.
The bot was developed for Haick Datathon competition organized by School of AI scientific club, where I currently hold the position of technical manager. This README replaces the previous brief description with a full developer-focused guide: development setup, technical architecture, Celery and Docker usage, deployment notes, testing, and troubleshooting.
- Project overview
- Quick start (local, without Docker)
- Docker & docker-compose (recommended)
- Celery (workers, scheduling, monitoring)
- Database and migrations
- Running inference and scoring
- API & Discord bot
- Testing
- Deployment notes
- Troubleshooting and tips
- File layout and responsibilities
- Security and sandboxing
- Contributing
Core responsibilities:
- Accept user submissions (model files + optional inference script) via Discord commands
- Run inference securely on a private testset inside containers
- Score outputs and update public/private leaderboards
- Persist submissions, teams, and results to the database
- Provide a REST API for leaderboard consumption
Key files and modules
main.py— Discord bot entrypoint, defines slash commands and submission flow.api_server.py— Flask app exposing leaderboard endpoints.celery_app.py— Celery application configuration.celery_tasks.py— Celery tasks for scheduling and running evaluations.run_inference_job.py— Entrypoint used to perform inference inside a sandbox/container.scoring.py— Scoring logic for predictions vs. ground truth.inference.py— Helpers used by inference scripts to run a model on the testset.utils.py— Misc utilities.create_teams.py— DB bootstrap utility for inserting initial teams/participants.database/— SQLAlchemy models, operations and session wiring.models/— Directory where user-provided model artifacts (.pth, .onnx) are stored.testset/— Private test dataset (not distributed).
Prereqs:
- Python 3.8+
- A virtual environment (venv/conda)
- Redis (for Celery) — local or containerized
Dataset
- Place your dataset under the
testset/directory at the project root before running evaluations. The repository does not include the dataset by default. - You can use the provided helper script to split your
testset/into public and private subsets (the script creates adata_split/folder and saves index files):
python helpers/create_data_split.pyThe script performs stratified sampling, preserves class directories, and saves public_idx_...npy and private_idx_...npy files in data_split/ for reproducibility.
- Create and activate a virtualenv
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Create
.envin project root with at least the following keys (example):
DISCORD_TOKEN=your_discord_token_here
REDIS_URL=redis://localhost:6379/0
DATABASE_URL=sqlite:///./evaluation.db- Initialize the SQLite DB and seed teams/participants
python create_teams.py- Start Redis if you don't already have it running (for example with Docker):
docker run -d --name redis -p 6379:6379 redis:7- Start a Celery worker (see Celery details below for recommended flags):
# from project root
celery -A celery_app.celery worker --loglevel=info -Q default- Run the Discord bot locally
python main.py- (Optional) Run the Flask API server for leaderboards
python api_server.pyNotes: When running locally without Dockerized sandboxing, do not run untrusted user code. The repository's Docker-based sandbox is designed for secure execution of third-party models.
There is a docker-compose.yml included to wire up services (Redis, optional db, API) and to help build worker images for sandboxing model runs.
Common commands:
# Build and start redis + api + any defined services
docker-compose up --build
# Start detached
docker-compose up -d --build
# Stop
docker-compose downContainer recommendataions:
- Use the provided Dockerfiles to build the evaluation runner image (see
DockerfileandDockerfile.celery). The runner image includes required Python packages and a minimal runtime to executerun_inference_job.pyinside an isolated container.
Security note: The Docker container used for running inference should mount only the model and the testset artifacts required to produce predictions, never the host root or secrets.
This project uses Celery to manage asynchronous evaluation jobs. Celery configuration is located in celery_app.py. Worker tasks are defined in celery_tasks.py and call the job runner (run_inference_job.py) inside a sandbox.
How tasks flow:
- A user submits a model via Discord.
main.pyuploads/places the model intomodels/and enqueues a Celery task to evaluate it. - The Celery worker receives the evaluation task, launches a Docker sandbox (or runs a local runner) that executes
run_inference_job.pywith the submission context. - The runner writes predictions to disk.
scoring.pyis invoked to compute metrics. Results are saved to DB and leaderboard updated.
Starting workers
# Start a worker
celery -A celery_app.celery worker --loglevel=info -Q default
# Start multiple workers (or use --concurrency=N). Example:
celery -A celery_app.celery worker --loglevel=info -Q default -c 4Scheduling and periodic tasks
If you use periodic scheduling (beat), run:
celery -A celery_app.celery beat --loglevel=info
# Or run worker + beat in separate terminals/containersMonitoring
- Use Flower (optional) for monitoring tasks:
pip install flowerthen run
celery -A celery_app.celery flower --port=5555Redis
Celery requires a broker. The project expects REDIS_URL or similar environment variable. The default docker-compose provides Redis. For local development you can run Redis via Docker as shown above.
The project uses SQLAlchemy in database/ and a small operations layer in database/operations.py to interact with submissions, teams and scores. By default the database is SQLite, controlled via DATABASE_URL environment variable.
Bootstrap the DB and seed data:
python create_teams.pyIf you move to Postgres for production, update DATABASE_URL and adjust docker-compose.yml to include a Postgres service, then run normal SQLAlchemy migrations (if you integrate Alembic).
The actual per-submission evaluation happens in run_inference_job.py. It expects arguments describing:
- which model file to load (path under
models/) - which dataset split to evaluate (private/public)
- an output directory for predictions
High-level runner contract (inputs/outputs):
- Inputs: model path, dataset split id, device (cpu/cuda optional), inference timeout
- Outputs: predictions file(s) in a structured format (CSV/JSON/NPY), a metrics JSON written by
scoring.pyor the task wrapper
Scoring is implemented in scoring.py. The runner should produce outputs compatible with the scorer.
Example (local run for testing):
python run_inference_job.py --model models/example.pth --split private --output tmp/outdir
python scoring.py --predictions tmp/outdir/preds.json --split privateNotes on timeouts and resource limits
- The Celery task wrapper should set a hard timeout for evaluation tasks to avoid long-running or hung jobs.
- In Docker-based sandboxing, enforce CPU/memory limits (docker run flags or compose
deploy.resources) so that user models cannot exhaust host resources.
API:
api_server.pyexposes endpoints under/api/leaderboard/*to fetch public/private leaderboards and team stats.
Discord bot:
main.pycontains the Discord bot implementation. It registers slash commands such as/register_participant,/evaluate_submission, and/leaderboard.- The bot should be run with
DISCORD_TOKENin the environment.
Submitting a model via Discord (high-level):
- The user uploads a
.pthor.onnxfile and triggers/evaluate_submission. - The bot stores the artifact in
models/under a unique name and enqueues a Celery evaluation task. - The user receives updates when the evaluation finishes and a link to the leaderboard entry.
There are a few tests under tests/ (for example tests/test_parser.py). Run them with pytest:
pytest -qAdd tests for new functionality — especially for scoring and runner I/O, and for the API endpoints.
Small-scale production deployment suggestions:
- Use Docker Compose or Kubernetes. For higher scale prefer k8s.
- Use a managed Redis (or a resilient cluster) as Celery broker.
- Use Postgres for DB in production and run migrations with Alembic.
- Run Celery workers with autoscaling (K8s HPA or a horizontal worker autoscaler) depending on queue depth.
- Run the Discord bot in a separate deployment and add health checks.
- Secure the API (authentication/authorization) if exposing beyond internal usage.
Example docker-compose production snippet (conceptual):
services:
redis:
image: redis:7
api:
build: .
command: python api_server.py
environment:
- DATABASE_URL=${DATABASE_URL}
worker:
build: .
command: celery -A celery_app.celery worker --loglevel=info
environment:
- REDIS_URL=${REDIS_URL}- If Celery tasks are not running: confirm Redis is reachable and that worker is started with correct app (A flag) and queue.
- If Docker runner cannot access model/testset: ensure proper mounts in
docker runor compose volumes, avoid mounting any sensitive host path. - If scoring results mismatch expectations: check that predictions format matches
scoring.pyexpectations (IDs, ordering, label format).
Logs
- Worker logs: stdout where Celery worker runs.
- API logs: stdout of
api_server.py. - Runner logs: captured by the task wrapper — ensure each run writes a run-specific log file under
tmp/job-<uuid>/.
main.py— Discord botapi_server.py— Flask APIcelery_app.py,celery_tasks.py— Celery config + tasksrun_inference_job.py— Runner invoked by tasksscoring.py— Evaluation metricsinference.py— Inference utilitiescreate_teams.py— DB seedingdatabase/— models, operationsmodels/— uploaded model artifactstestset/— private dataset (not to be shared)tmp/— job-specific temporary outputs and logs
- Always run user-submitted code in an isolated environment (Docker container with strict resource limits and no host mounts except necessary inputs/outputs).
- Never run untrusted code as root inside containers.
- Validate model artifacts and uploaded files (size limits, file type checks) before enqueuing tasks.
- Fork, create a feature branch, add tests for new behavior, and open a pull request.
- Keep changes small, add documentation updates in
README.mdor thedocs/directory if created.
- Add CI (GitHub Actions) to run tests and flake/lint on PRs.
- Add Alembic migrations if you move to a RDBMS other than SQLite.
- Add a minimal integration test that runs a small model inside a container to validate the runner + scoring pipeline.
If you need help or want to extend the project, open an issue on the repo with details about your use-case and environment.