K2-18 - Educational Knowledge Graph Converter

Why K2-18?

Traditional educational content is linear text, but learning is a network of interconnected concepts. K2-18 bridges this gap by automatically extracting the hidden knowledge structure from textbooks and educational materials.

The Problem: Educational platforms need structured content for adaptive learning, prerequisite tracking, and personalized paths, but manually creating this structure is prohibitively expensive.

The Solution: K2-18 automatically converts any educational text into a semantic knowledge graph with:

Extracted concepts with definitions and relationships
Learning dependencies (what to learn first)
Difficulty levels and assessment points
Semantic connections between distant but related topics

What You Get

The pipeline produces two main outputs:

ConceptDictionary - comprehensive vocabulary of all concepts with definitions, aliases, and cross-references
LearningChunkGraph - semantic graph connecting content chunks, concepts, and assessments with typed relationships

Architecture

K2-18 implements the iText2KG (Incremental Text to Knowledge Graph) approach - incremental knowledge graph construction from text, designed to work within LLM context window limitations.

Processing Pipeline

Raw Content (.md, .txt, .html)
    ↓
1. Slicer             → Semantic Chunks (respecting paragraph boundaries)
    ↓
2. iText2KG Concepts  → Concept Dictionary (with all concepts extracted)
    ↓
3. iText2KG Graph     → Knowledge Graph (using Concept Dictionary)
    ↓
4. Dedup              → Knowledge Graph (with semantic duplicates removed)
    ↓
5. Refiner Longrange  → Knowledge Graph (with long-range connections added)

Use Cases

After running the pipeline, you can:

Curriculum Analysis: Identify knowledge gaps and redundancies
Build Learning Paths: Find optimal prerequisite chains between topics
Adaptive Learning: Power recommendation systems with concept dependencies
Content Quality: Detect missing prerequisites or circular dependencies
Import into Neo4j: Use the JSON graph directly with Neo4j's import tools (see Neo4j documentation)

Key Features

Incremental Processing: Handles books of multiple pages by processing in chunks
Context Preservation: Maintains semantic continuity across chunk boundaries
Smart Deduplication: Uses embeddings to identify and merge semantically identical content
Long-range Connections: Discovers relationships between concepts separated by many pages (forward/backward pass)
Language Support: Any UTF-8 text content

Visualization & Analytics Module

K2-18 includes a powerful visualization module that enriches your knowledge graphs with educational metrics and creates interactive tools for exploration.

What it does:

Computes 12 network metrics revealing graph structure and learning paths
Identifies fundamental concepts, knowledge bridges, and topic clusters
Generates two complementary HTML tools:
- Interactive graph - visual exploration with Cytoscape.js
- Detailed viewer - three-column interface for methodical analysis

Perfect for quality control, curriculum analysis, and presenting results to stakeholders. Both tools work standalone in any browser.

For detailed documentation, see Visualization Module Guide.

Requirements

Python 3.11+
OpenAI API access (Responses/Embeddings API)
Memory: Not tested (entire corpus processed in memory)
OS: Windows, MacOS

Installation

For Users

# Clone the repository
git clone https://github.com/yourusername/k2-18.git
cd k2-18

# Create virtual environment
python -m venv .venv

# Activate it (choose your platform):
source .venv/bin/activate         # Linux/macOS
.venv\Scripts\activate            # Windows

# Install dependencies
pip install -r requirements.txt

# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key"  # Linux/macOS
set OPENAI_API_KEY=your-api-key       # Windows

For Developers

# Same initial setup as above, then:

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

Quick Start

Prepare content:
```
# Place educational materials in:
data/raw/
```
Supported formats: .md, .txt, .html. All content must be merged into one file.
Configure processing (optional): Edit src/config.toml to adjust parameters like chunk size, timeouts, and model selection.

Run the pipeline:

# Step-by-step processing
python -m src.slicer               # Split into chunks
python -m src.itext2kg_concepts    # Extract concepts
python -m src.itext2kg_graph       # Build knowledge graph
python -m src.dedup                # Remove duplicates if any
python -m src.refiner_longrange    # Add distant connections

Find your results:

data/out/
├── ConceptDictionary.json             # All extracted concepts
├── LearningChunkGraph_raw.json        # Initial graph
├── LearningChunkGraph_dedup.json      # After deduplication
└── LearningChunkGraph_longrange.json  # Final graph

Configuration

Main settings in src/config.toml:

[slicer]
max_tokens = 5000          # Chunk size in tokens
soft_boundary = true       # Respect semantic boundaries

[itext2kg]
model = "..."              # OpenAI model selection
tpm_limit = 150000         # API rate limit (tokens/minute) based on your Tier
max_output_tokens = 25000  # Max response size

[dedup]
sim_threshold = 0.85       # Similarity threshold for duplicates

[refiner]
run = true                 # Enable/disable refiner stage
sim_threshold = 0.7        # Threshold for new connections

Data Formats

All data formats are defined by JSON schemas in /src/schemas/:

ConceptDictionary.schema.json - concept vocabulary structure
LearningChunkGraph.schema.json - knowledge graph structure

Documentation

Component specifications: /docs/specs/
- Pipeline utilities: cli_slicer.md, cli_itext2kg_concepts.md, cli_itext2kg_graph.md, cli_dedup.md, cli_refiner_longrange.md
- Core modules: util_llm_client.md, util_config.md, util_validation.md, util_tokenizer.md, etc.
Data schemas: /src/schemas/
- ConceptDictionary.schema.json - concept vocabulary structure
- LearningChunkGraph.schema.json - knowledge graph structure
LLM prompts: /src/prompts/
- itext2kg_concepts_extraction.md - concept extraction from text
- itext2kg_graph_extraction.md - knowledge graph construction
- refiner_longrange_fw.md - forward pass for long-range connections
- refiner_longrange_bw.md - backward pass for long-range connections

⚠️ Important: Current prompts are optimized for Computer Science, Management or Economics content in Russian. For other domains (management, history, biology, etc.) or languages, prompts REQUIRE adaptation to domain-specific terminology and concept patterns.

Limitations

Memory-bound: Entire corpus processed in memory
Sequential: No parallel processing (to maintain context/TPM limits)
API-dependent: Requires stable OpenAI API access
Token limits: Constrained by LLM context windows

Troubleshooting

API Rate Limits

Check your OpenAI API Tier limits
Adjust tpm_limit in config
Pipeline will auto-retry with backoff

Incomplete Processing

Check exit codes and logs in /logs/
Use previous_response_id for context continuity
Use timeout and max_retries to manage retries
Utilities do not support resuming from last successful slice

Development

Contributing

Follow TDD approach - write tests first
All functions must have type hints
Run quality checks before commits
Update relevant specifications in /docs/specs/

Code Quality

# Format code
black src/
isort src/

# Check quality
ruff check src/
flake8 src/
mypy src/

Running Tests

# Activate virtual environment first
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

# Run all tests but integration (fast!)
pytest tests/ -v -m "not integration"

# Full test suite incl. integration (~5-7m and require OpenAI API key)
pytest tests/ -v

Test markers:

integration - Tests requiring real API calls
slow - Tests taking >30 seconds
timeout - Tests with explicit timeout settings

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Support

Check /docs/specs/ for detailed component documentation
Review logs in /logs/ for debugging
Open a Github Issue

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
docs		docs
logs		logs
src		src
tests		tests
viz		viz
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K2-18 - Educational Knowledge Graph Converter

Why K2-18?

What You Get

Architecture

Processing Pipeline

Use Cases

Key Features

Visualization & Analytics Module

Requirements

Installation

For Users

For Developers

Quick Start

Configuration

Data Formats

Documentation

Limitations

Troubleshooting

API Rate Limits

Incomplete Processing

Development

Contributing

Code Quality

Running Tests

MIT License

Support

About

Uh oh!

Languages

zebrr/k2-18

Folders and files

Latest commit

History

Repository files navigation

K2-18 - Educational Knowledge Graph Converter

Why K2-18?

What You Get

Architecture

Processing Pipeline

Use Cases

Key Features

Visualization & Analytics Module

Requirements

Installation

For Users

For Developers

Quick Start

Configuration

Data Formats

Documentation

Limitations

Troubleshooting

API Rate Limits

Incomplete Processing

Development

Contributing

Code Quality

Running Tests

MIT License

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages