Skip to content
/ k2-18 Public

K2-18 automatically converts any educational text into a semantic knowledge graph with: extracted concepts with definitions and relationships; learning dependencies (what to learn first); difficulty levels and assessment points; semantic connections between distant but related topics.

Notifications You must be signed in to change notification settings

zebrr/k2-18

Repository files navigation

K2-18 - Educational Knowledge Graph Converter

License: MIT Python JavaScript OpenAI Cytoscape.js

Why K2-18?

Traditional educational content is linear text, but learning is a network of interconnected concepts. K2-18 bridges this gap by automatically extracting the hidden knowledge structure from textbooks and educational materials.

The Problem: Educational platforms need structured content for adaptive learning, prerequisite tracking, and personalized paths, but manually creating this structure is prohibitively expensive.

The Solution: K2-18 automatically converts any educational text into a semantic knowledge graph with:

  • Extracted concepts with definitions and relationships
  • Learning dependencies (what to learn first)
  • Difficulty levels and assessment points
  • Semantic connections between distant but related topics

What You Get

The pipeline produces two main outputs:

  • ConceptDictionary - comprehensive vocabulary of all concepts with definitions, aliases, and cross-references
  • LearningChunkGraph - semantic graph connecting content chunks, concepts, and assessments with typed relationships

Architecture

K2-18 implements the iText2KG (Incremental Text to Knowledge Graph) approach - incremental knowledge graph construction from text, designed to work within LLM context window limitations.

Processing Pipeline

Raw Content (.md, .txt, .html)
    ↓
1. Slicer             → Semantic Chunks (respecting paragraph boundaries)
    ↓
2. iText2KG Concepts  → Concept Dictionary (with all concepts extracted)
    ↓
3. iText2KG Graph     → Knowledge Graph (using Concept Dictionary)
    ↓
4. Dedup              → Knowledge Graph (with semantic duplicates removed)
    ↓
5. Refiner Longrange  → Knowledge Graph (with long-range connections added)

Use Cases

After running the pipeline, you can:

  • Curriculum Analysis: Identify knowledge gaps and redundancies
  • Build Learning Paths: Find optimal prerequisite chains between topics
  • Adaptive Learning: Power recommendation systems with concept dependencies
  • Content Quality: Detect missing prerequisites or circular dependencies
  • Import into Neo4j: Use the JSON graph directly with Neo4j's import tools (see Neo4j documentation)

Key Features

  • Incremental Processing: Handles books of multiple pages by processing in chunks
  • Context Preservation: Maintains semantic continuity across chunk boundaries
  • Smart Deduplication: Uses embeddings to identify and merge semantically identical content
  • Long-range Connections: Discovers relationships between concepts separated by many pages (forward/backward pass)
  • Language Support: Any UTF-8 text content

Visualization & Analytics Module

K2-18 includes a powerful visualization module that enriches your knowledge graphs with educational metrics and creates interactive tools for exploration.

What it does:

  • Computes 12 network metrics revealing graph structure and learning paths
  • Identifies fundamental concepts, knowledge bridges, and topic clusters
  • Generates two complementary HTML tools:
    • Interactive graph - visual exploration with Cytoscape.js
    • Detailed viewer - three-column interface for methodical analysis

Perfect for quality control, curriculum analysis, and presenting results to stakeholders. Both tools work standalone in any browser.

For detailed documentation, see Visualization Module Guide.

Requirements

  • Python 3.11+
  • OpenAI API access (Responses/Embeddings API)
  • Memory: Not tested (entire corpus processed in memory)
  • OS: Windows, MacOS

Installation

For Users

# Clone the repository
git clone https://github.com/yourusername/k2-18.git
cd k2-18

# Create virtual environment
python -m venv .venv

# Activate it (choose your platform):
source .venv/bin/activate         # Linux/macOS
.venv\Scripts\activate            # Windows

# Install dependencies
pip install -r requirements.txt

# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key"  # Linux/macOS
set OPENAI_API_KEY=your-api-key       # Windows

For Developers

# Same initial setup as above, then:

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

Quick Start

  1. Prepare content:

    # Place educational materials in:
    data/raw/

    Supported formats: .md, .txt, .html. All content must be merged into one file.

  2. Configure processing (optional): Edit src/config.toml to adjust parameters like chunk size, timeouts, and model selection.

  3. Run the pipeline:

    # Step-by-step processing
    python -m src.slicer               # Split into chunks
    python -m src.itext2kg_concepts    # Extract concepts
    python -m src.itext2kg_graph       # Build knowledge graph
    python -m src.dedup                # Remove duplicates if any
    python -m src.refiner_longrange    # Add distant connections
  4. Find your results:

    data/out/
    ├── ConceptDictionary.json             # All extracted concepts
    ├── LearningChunkGraph_raw.json        # Initial graph
    ├── LearningChunkGraph_dedup.json      # After deduplication
    └── LearningChunkGraph_longrange.json  # Final graph

Configuration

Main settings in src/config.toml:

[slicer]
max_tokens = 5000          # Chunk size in tokens
soft_boundary = true       # Respect semantic boundaries

[itext2kg]
model = "..."              # OpenAI model selection
tpm_limit = 150000         # API rate limit (tokens/minute) based on your Tier
max_output_tokens = 25000  # Max response size

[dedup]
sim_threshold = 0.85       # Similarity threshold for duplicates

[refiner]
run = true                 # Enable/disable refiner stage
sim_threshold = 0.7        # Threshold for new connections

Data Formats

All data formats are defined by JSON schemas in /src/schemas/:

  • ConceptDictionary.schema.json - concept vocabulary structure
  • LearningChunkGraph.schema.json - knowledge graph structure

Documentation

  • Component specifications: /docs/specs/

    • Pipeline utilities: cli_slicer.md, cli_itext2kg_concepts.md, cli_itext2kg_graph.md, cli_dedup.md, cli_refiner_longrange.md
    • Core modules: util_llm_client.md, util_config.md, util_validation.md, util_tokenizer.md, etc.
  • Data schemas: /src/schemas/

    • ConceptDictionary.schema.json - concept vocabulary structure
    • LearningChunkGraph.schema.json - knowledge graph structure
  • LLM prompts: /src/prompts/

    • itext2kg_concepts_extraction.md - concept extraction from text
    • itext2kg_graph_extraction.md - knowledge graph construction
    • refiner_longrange_fw.md - forward pass for long-range connections
    • refiner_longrange_bw.md - backward pass for long-range connections

⚠️ Important: Current prompts are optimized for Computer Science, Management or Economics content in Russian. For other domains (management, history, biology, etc.) or languages, prompts REQUIRE adaptation to domain-specific terminology and concept patterns.

Limitations

  • Memory-bound: Entire corpus processed in memory
  • Sequential: No parallel processing (to maintain context/TPM limits)
  • API-dependent: Requires stable OpenAI API access
  • Token limits: Constrained by LLM context windows

Troubleshooting

API Rate Limits

  • Check your OpenAI API Tier limits
  • Adjust tpm_limit in config
  • Pipeline will auto-retry with backoff

Incomplete Processing

  • Check exit codes and logs in /logs/
  • Use previous_response_id for context continuity
  • Use timeout and max_retries to manage retries
  • Utilities do not support resuming from last successful slice

Development

Contributing

  1. Follow TDD approach - write tests first
  2. All functions must have type hints
  3. Run quality checks before commits
  4. Update relevant specifications in /docs/specs/

Code Quality

# Format code
black src/
isort src/

# Check quality
ruff check src/
flake8 src/
mypy src/

Running Tests

# Activate virtual environment first
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

# Run all tests but integration (fast!)
pytest tests/ -v -m "not integration"

# Full test suite incl. integration (~5-7m and require OpenAI API key)
pytest tests/ -v

Test markers:

  • integration - Tests requiring real API calls
  • slow - Tests taking >30 seconds
  • timeout - Tests with explicit timeout settings

MIT License

Copyright (c) 2025 Askold Romanov

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Support

  • Check /docs/specs/ for detailed component documentation
  • Review logs in /logs/ for debugging
  • Open a Github Issue

About

K2-18 automatically converts any educational text into a semantic knowledge graph with: extracted concepts with definitions and relationships; learning dependencies (what to learn first); difficulty levels and assessment points; semantic connections between distant but related topics.

Topics

Resources

Stars

Watchers

Forks