Process Extraction Demo

A knowledge graph generation demo that extracts processes and steps from instructional documents and loads them into Neo4j for analysis and visualization.

Introduction

This project takes documents containing business processes and their associated steps as input, then automatically extracts structured process information and creates both lexical and entity graphs in Neo4j. It's particularly useful for analyzing complex multi-step business processes like supply chain management, manufacturing workflows, and operational procedures.

The system uses advanced AI models to understand process hierarchies, step dependencies, and relationships between different process components, making it easier to visualize and analyze complex organizational workflows.

Tech Stack

Neo4j: Graph database for storing lexical and entity relationships
Unstructured.IO: Document partitioning and chunking
Pydantic: Data validation and modeling
Instructor: LLM response modeling and retry logic
OpenAI GPT-4: Entity extraction and process understanding
Pandas: Data processing and transformation

Installation

This project uses uv for Python package management. Make sure you have uv installed on your system.

Clone the repository:

git clone <repository-url>
cd process-extraction-demo

Install dependencies:
```
uv sync
```

Set up environment variables: Create a .env file or set the following environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export NEO4J_URI="bolt://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="your-neo4j-password"
export NEO4J_DATABASE="neo4j"  # optional, defaults to "neo4j"

Start Neo4j database: Make sure you have a Neo4j instance running and accessible with the credentials above.

What the Notebook Does

The process.ipynb notebook performs three main operations:

1. Document Partitioning and Chunking

Takes input text files (like sample.txt) containing process documentation
Uses Unstructured.IO to partition documents by title sections
Creates structured records for documents and text chunks

2. Entity Extraction

Uses GPT-4 with structured output to extract processes and steps from text
Employs Pydantic models to ensure consistent data structure:
- Process: Contains name and sequence of steps
- Step: Contains ID, description, parent/child relationships, and sequence information
Implements retry logic and error handling for robust extraction
Maintains context across chunks to handle processes that span multiple sections
Supports hierarchical steps with parent-child relationships

3. Neo4j Knowledge Graph Creation

The notebook creates two interconnected graph structures:

Lexical Graph:

Document nodes representing source files
Chunk nodes representing text sections
HAS_CHUNK relationships connecting documents to their chunks

Entity Graph:

Process nodes representing identified business processes
Step nodes representing individual process steps
NEXT_STEP relationships showing step sequences
HAS_CHILD_STEP relationships for hierarchical steps
HAS_FIRST_STEP relationships connecting processes to their starting steps

Bridge Connections:

HAS_ENTITY relationships linking chunks to the processes and steps they contain

Key Features

Hierarchical Process Modeling: Supports nested steps and sub-processes
Sequence Tracking: Maintains proper step ordering and dependencies
Context Preservation: Handles processes that span multiple document sections
Error Recovery: Robust handling of extraction failures with detailed logging
Graph Constraints: Implements proper database constraints for data integrity
Flexible Input: Works with various text document formats

Known Bugs

Leaf child steps sometimes are not connected to the next step in sequence, instead the parent step is connected to the next parent step

Example Knowledge Graph

The resulting knowledge graph enables powerful queries for process analysis, dependency mapping, and workflow optimization.

Below is the resulting knowledge graph from the process.ipynb notebook.

And here is the isolated entity graph containing only a single Process node and the sequential Steps.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs/images		docs/images
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
process.ipynb		process.ipynb
pyproject.toml		pyproject.toml
sample.txt		sample.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Process Extraction Demo

Introduction

Tech Stack

Installation

What the Notebook Does

1. Document Partitioning and Chunking

2. Entity Extraction

3. Neo4j Knowledge Graph Creation

Key Features

Known Bugs

Example Knowledge Graph

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Process Extraction Demo

Introduction

Tech Stack

Installation

What the Notebook Does

1. Document Partitioning and Chunking

2. Entity Extraction

3. Neo4j Knowledge Graph Creation

Key Features

Known Bugs

Example Knowledge Graph

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages