Traditional educational content is linear text, but learning is a network of interconnected concepts. K2-18 bridges this gap by automatically extracting the hidden knowledge structure from textbooks and educational materials.
The Problem: Educational platforms need structured content for adaptive learning, prerequisite tracking, and personalized paths, but manually creating this structure is prohibitively expensive.
The Solution: K2-18 automatically converts any educational text into a semantic knowledge graph with:
- Extracted concepts with definitions and relationships
- Learning dependencies (what to learn first)
- Difficulty levels and assessment points
- Semantic connections between distant but related topics
The pipeline produces two main outputs:
- ConceptDictionary - comprehensive vocabulary of all concepts with definitions, aliases, and cross-references
- LearningChunkGraph - semantic graph connecting content chunks, concepts, and assessments with typed relationships
K2-18 implements the iText2KG (Incremental Text to Knowledge Graph) approach - incremental knowledge graph construction from text, designed to work within LLM context window limitations.
Raw Content (.md, .txt, .html)
↓
1. Slicer → Semantic Chunks (respecting paragraph boundaries)
↓
2. iText2KG Concepts → Concept Dictionary (with all concepts extracted)
↓
3. iText2KG Graph → Knowledge Graph (using Concept Dictionary)
↓
4. Dedup → Knowledge Graph (with semantic duplicates removed)
↓
5. Refiner Longrange → Knowledge Graph (with long-range connections added)
After running the pipeline, you can:
- Curriculum Analysis: Identify knowledge gaps and redundancies
- Build Learning Paths: Find optimal prerequisite chains between topics
- Adaptive Learning: Power recommendation systems with concept dependencies
- Content Quality: Detect missing prerequisites or circular dependencies
- Import into Neo4j: Use the JSON graph directly with Neo4j's import tools (see Neo4j documentation)
- Incremental Processing: Handles books of multiple pages by processing in chunks
- Context Preservation: Maintains semantic continuity across chunk boundaries
- Smart Deduplication: Uses embeddings to identify and merge semantically identical content
- Long-range Connections: Discovers relationships between concepts separated by many pages (forward/backward pass)
- Language Support: Any UTF-8 text content
K2-18 includes a powerful visualization module that enriches your knowledge graphs with educational metrics and creates interactive tools for exploration.
What it does:
- Computes 12 network metrics revealing graph structure and learning paths
- Identifies fundamental concepts, knowledge bridges, and topic clusters
- Generates two complementary HTML tools:
- Interactive graph - visual exploration with Cytoscape.js
- Detailed viewer - three-column interface for methodical analysis
Perfect for quality control, curriculum analysis, and presenting results to stakeholders. Both tools work standalone in any browser.
For detailed documentation, see Visualization Module Guide.
- Python 3.11+
- OpenAI API access (Responses/Embeddings API)
- Memory: Not tested (entire corpus processed in memory)
- OS: Windows, MacOS
# Clone the repository
git clone https://github.com/yourusername/k2-18.git
cd k2-18
# Create virtual environment
python -m venv .venv
# Activate it (choose your platform):
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key" # Linux/macOS
set OPENAI_API_KEY=your-api-key # Windows# Same initial setup as above, then:
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/-
Prepare content:
# Place educational materials in: data/raw/Supported formats:
.md,.txt,.html. All content must be merged into one file. -
Configure processing (optional): Edit
src/config.tomlto adjust parameters like chunk size, timeouts, and model selection. -
Run the pipeline:
# Step-by-step processing python -m src.slicer # Split into chunks python -m src.itext2kg_concepts # Extract concepts python -m src.itext2kg_graph # Build knowledge graph python -m src.dedup # Remove duplicates if any python -m src.refiner_longrange # Add distant connections
-
Find your results:
data/out/ ├── ConceptDictionary.json # All extracted concepts ├── LearningChunkGraph_raw.json # Initial graph ├── LearningChunkGraph_dedup.json # After deduplication └── LearningChunkGraph_longrange.json # Final graph
Main settings in src/config.toml:
[slicer]
max_tokens = 5000 # Chunk size in tokens
soft_boundary = true # Respect semantic boundaries
[itext2kg]
model = "..." # OpenAI model selection
tpm_limit = 150000 # API rate limit (tokens/minute) based on your Tier
max_output_tokens = 25000 # Max response size
[dedup]
sim_threshold = 0.85 # Similarity threshold for duplicates
[refiner]
run = true # Enable/disable refiner stage
sim_threshold = 0.7 # Threshold for new connectionsAll data formats are defined by JSON schemas in /src/schemas/:
ConceptDictionary.schema.json- concept vocabulary structureLearningChunkGraph.schema.json- knowledge graph structure
-
Component specifications:
/docs/specs/- Pipeline utilities:
cli_slicer.md,cli_itext2kg_concepts.md,cli_itext2kg_graph.md,cli_dedup.md,cli_refiner_longrange.md - Core modules:
util_llm_client.md,util_config.md,util_validation.md,util_tokenizer.md, etc.
- Pipeline utilities:
-
Data schemas:
/src/schemas/ConceptDictionary.schema.json- concept vocabulary structureLearningChunkGraph.schema.json- knowledge graph structure
-
LLM prompts:
/src/prompts/itext2kg_concepts_extraction.md- concept extraction from textitext2kg_graph_extraction.md- knowledge graph constructionrefiner_longrange_fw.md- forward pass for long-range connectionsrefiner_longrange_bw.md- backward pass for long-range connections
- Memory-bound: Entire corpus processed in memory
- Sequential: No parallel processing (to maintain context/TPM limits)
- API-dependent: Requires stable OpenAI API access
- Token limits: Constrained by LLM context windows
- Check your OpenAI API Tier limits
- Adjust
tpm_limitin config - Pipeline will auto-retry with backoff
- Check exit codes and logs in
/logs/ - Use
previous_response_idfor context continuity - Use
timeoutandmax_retriesto manage retries - Utilities do not support resuming from last successful slice
- Follow TDD approach - write tests first
- All functions must have type hints
- Run quality checks before commits
- Update relevant specifications in
/docs/specs/
# Format code
black src/
isort src/
# Check quality
ruff check src/
flake8 src/
mypy src/# Activate virtual environment first
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
# Run all tests but integration (fast!)
pytest tests/ -v -m "not integration"
# Full test suite incl. integration (~5-7m and require OpenAI API key)
pytest tests/ -v
Test markers:
integration- Tests requiring real API callsslow- Tests taking >30 secondstimeout- Tests with explicit timeout settings
Copyright (c) 2025 Askold Romanov
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Check
/docs/specs/for detailed component documentation - Review logs in
/logs/for debugging - Open a Github Issue