Skip to content

ml-jku/moleculariq-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolecularIQ Core

The complete library for molecular reasoning: question generation, property computation, and answer evaluation

License: MIT Python 3.9+ RDKit Tests

Everything you need for molecular reasoning benchmarks in one package

InstallationQuick StartAPI OverviewMolecularIQ Family


Overview

MolecularIQ Core is the central library for the MolecularIQ benchmark ecosystem:

  • MolecularIQD: High-level API for dynamic question generation and evaluation
  • Molecule Pools: Access training molecules (validation pools hidden to prevent leakage)
  • SymbolicSolver: Compute 100+ molecular properties from SMILES
  • Reward Functions: Evaluate model predictions against ground truth
  • NaturalLanguageFormatter: Convert between technical keys and natural language

Installation

# From GitHub
pip install git+https://github.com/ml-jku/moleculariq-core.git

# From source (for development)
pip install -e ".[dev]"

Requirements: Python 3.9+ and RDKit

Quick Start

The Easy Way: MolecularIQD

For most users, MolecularIQD is the recommended entry point:

from moleculariq_core import MolecularIQD

mqd = MolecularIQD(seed=42)

# Generate a count question
question, answer, metadata = mqd.generate_count_question(
    smiles="c1ccccc1",
    count_properties="aromatic_ring_count"
)
print(question)
# "How many aromatic rings are in c1ccccc1? Return the result as JSON with key `aromatic_ring_count`."
print(answer)
# {'aromatic_ring_count': 1}

# Generate an index question
question, answer, metadata = mqd.generate_index_question(
    smiles="CCO",
    index_properties="carbon_atom_index"
)
print(answer)
# {'carbon_atom_index': [0, 1]}

# Validate predictions
score = mqd.validate_count_answer("c1ccccc1", {"aromatic_ring_count": 1})
# 1.0

# Generate a constraint question (molecule generation task)
question, metadata = mqd.generate_constraint_question(
    constraints=[{"property": "ring_count", "operator": ">=", "value": 2}]
)
# "Generate a molecule with at least 2 rings..."

# Validate a generated molecule against constraints
score = mqd.validate_constraint_answer(
    "c1ccc2ccccc2c1",  # Naphthalene (2 fused rings)
    metadata["constraints"]
)
# 1.0

Load Training Molecules

from moleculariq_core import load_molecule_pool

# Load training pool for development
train_smiles = load_molecule_pool("train")
print(f"Loaded {len(train_smiles)} training molecules")

# Validation pools are hidden to prevent data leakage
load_molecule_pool("val_hard")  # Raises MoleculePoolHiddenError

Training Loop Example

from moleculariq_core import MolecularIQD, load_molecule_pool
import random

mqd = MolecularIQD(seed=42)
train_smiles = load_molecule_pool("train")

for epoch in range(num_epochs):
    smiles = random.choice(train_smiles)
    question, answer, _ = mqd.generate_count_question(smiles, "ring_count")

    prediction = model(question)
    reward = mqd.validate_count_answer(smiles, prediction, answer)
    # ... update model

Low-Level Access (Advanced)

For custom pipelines, access the primitives directly:

from moleculariq_core import SymbolicSolver, evaluate_answer

solver = SymbolicSolver()
smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin

# Compute properties
solver.get_ring_count(smiles)           # 1
solver.get_aromatic_ring_count(smiles)  # 1
solver.get_carbon_atom_count(smiles)    # 9

# Evaluate predictions
score = evaluate_answer(
    task_type="single_count",
    predicted={"ring_count": 1},
    target={"ring_count": 1}
)  # 1.0

API Overview

High-Level API

Component Description
MolecularIQD All-in-one class for question generation and evaluation
load_molecule_pool Load training molecules from HuggingFace

MolecularIQD Methods

Method Description
generate_count_question() Generate counting questions
generate_index_question() Generate atom index identification questions
generate_constraint_question() Generate molecule generation questions
validate_count_answer() Validate count predictions
validate_index_answer() Validate index predictions
validate_constraint_answer() Validate generated molecules
generate_paired_question() Generate matched count/index pairs
compute_property() Compute any molecular property

Low-Level Primitives

Module Description
SymbolicSolver Compute 100+ molecular properties from SMILES
FunctionalGroupSolver SMARTS-based functional group detection
NaturalLanguageFormatter Bidirectional NL ↔ technical key conversion
evaluate_answer Unified reward function dispatcher

Property Categories

Category Examples
Topology ring_count, fused_ring_count, bridgehead_atom_count
Aromaticity aromatic_ring_count, aliphatic_ring_count
Atoms carbon_atom_count, hetero_atom_count, halogen_atom_count
Bonds rotatable_bond_count, double_bond_count
H-bonding hba_count, hbd_count
Stereochemistry stereocenter_count, r_s_stereocenter_r_count
Functional Groups 60+ groups (alcohol, ketone, amine, etc.)
Reactions 40+ templates (bromination, oxidation, etc.)

Task Types

Task Output Description
single_count INTEGER Count one property
multi_count DICT Count multiple properties
single_index LIST Identify atom indices for one property
multi_index DICT Identify indices for multiple properties
constraint_generation SMILES Generate molecule satisfying constraints

MolecularIQ Family

This package is part of the MolecularIQ ecosystem:

Repository Purpose
moleculariq Central hub for the MolecularIQ benchmark ecosystem
moleculariq-leaderboard Leaderboard: HuggingFace space for results and submissions
moleculariq-core Core library: question generation, property computation, evaluation, molecule pools
moleculariq-benchmark Dataset creation pipeline
moleculariq-eval Evaluation code: lm-eval-harness integration

License

MIT License - see LICENSE for details.


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages