A knowledge graph generation demo that extracts processes and steps from instructional documents and loads them into Neo4j for analysis and visualization.
This project takes documents containing business processes and their associated steps as input, then automatically extracts structured process information and creates both lexical and entity graphs in Neo4j. It's particularly useful for analyzing complex multi-step business processes like supply chain management, manufacturing workflows, and operational procedures.
The system uses advanced AI models to understand process hierarchies, step dependencies, and relationships between different process components, making it easier to visualize and analyze complex organizational workflows.
- Neo4j: Graph database for storing lexical and entity relationships
- Unstructured.IO: Document partitioning and chunking
- Pydantic: Data validation and modeling
- Instructor: LLM response modeling and retry logic
- OpenAI GPT-4: Entity extraction and process understanding
- Pandas: Data processing and transformation
This project uses uv for Python package management. Make sure you have uv installed on your system.
-
Clone the repository:
git clone <repository-url> cd process-extraction-demo
-
Install dependencies:
uv sync
-
Set up environment variables: Create a
.envfile or set the following environment variables:export OPENAI_API_KEY="your-openai-api-key" export NEO4J_URI="bolt://localhost:7687" export NEO4J_USERNAME="neo4j" export NEO4J_PASSWORD="your-neo4j-password" export NEO4J_DATABASE="neo4j" # optional, defaults to "neo4j"
-
Start Neo4j database: Make sure you have a Neo4j instance running and accessible with the credentials above.
The process.ipynb notebook performs three main operations:
- Takes input text files (like
sample.txt) containing process documentation - Uses Unstructured.IO to partition documents by title sections
- Creates structured records for documents and text chunks
- Uses GPT-4 with structured output to extract processes and steps from text
- Employs Pydantic models to ensure consistent data structure:
- Process: Contains name and sequence of steps
- Step: Contains ID, description, parent/child relationships, and sequence information
- Implements retry logic and error handling for robust extraction
- Maintains context across chunks to handle processes that span multiple sections
- Supports hierarchical steps with parent-child relationships
The notebook creates two interconnected graph structures:
Lexical Graph:
Documentnodes representing source filesChunknodes representing text sectionsHAS_CHUNKrelationships connecting documents to their chunks
Entity Graph:
Processnodes representing identified business processesStepnodes representing individual process stepsNEXT_STEPrelationships showing step sequencesHAS_CHILD_STEPrelationships for hierarchical stepsHAS_FIRST_STEPrelationships connecting processes to their starting steps
Bridge Connections:
HAS_ENTITYrelationships linking chunks to the processes and steps they contain
- Hierarchical Process Modeling: Supports nested steps and sub-processes
- Sequence Tracking: Maintains proper step ordering and dependencies
- Context Preservation: Handles processes that span multiple document sections
- Error Recovery: Robust handling of extraction failures with detailed logging
- Graph Constraints: Implements proper database constraints for data integrity
- Flexible Input: Works with various text document formats
- Leaf child steps sometimes are not connected to the next step in sequence, instead the parent step is connected to the next parent step
The resulting knowledge graph enables powerful queries for process analysis, dependency mapping, and workflow optimization.
Below is the resulting knowledge graph from the process.ipynb notebook.
And here is the isolated entity graph containing only a single Process node and the sequential Steps.

