Skip to content

neo4j-field/process-extraction-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Process Extraction Demo

A knowledge graph generation demo that extracts processes and steps from instructional documents and loads them into Neo4j for analysis and visualization.

Introduction

This project takes documents containing business processes and their associated steps as input, then automatically extracts structured process information and creates both lexical and entity graphs in Neo4j. It's particularly useful for analyzing complex multi-step business processes like supply chain management, manufacturing workflows, and operational procedures.

The system uses advanced AI models to understand process hierarchies, step dependencies, and relationships between different process components, making it easier to visualize and analyze complex organizational workflows.

Tech Stack

  • Neo4j: Graph database for storing lexical and entity relationships
  • Unstructured.IO: Document partitioning and chunking
  • Pydantic: Data validation and modeling
  • Instructor: LLM response modeling and retry logic
  • OpenAI GPT-4: Entity extraction and process understanding
  • Pandas: Data processing and transformation

Installation

This project uses uv for Python package management. Make sure you have uv installed on your system.

  1. Clone the repository:

    git clone <repository-url>
    cd process-extraction-demo
  2. Install dependencies:

    uv sync
  3. Set up environment variables: Create a .env file or set the following environment variables:

    export OPENAI_API_KEY="your-openai-api-key"
    export NEO4J_URI="bolt://localhost:7687"
    export NEO4J_USERNAME="neo4j"
    export NEO4J_PASSWORD="your-neo4j-password"
    export NEO4J_DATABASE="neo4j"  # optional, defaults to "neo4j"
  4. Start Neo4j database: Make sure you have a Neo4j instance running and accessible with the credentials above.

What the Notebook Does

The process.ipynb notebook performs three main operations:

1. Document Partitioning and Chunking

  • Takes input text files (like sample.txt) containing process documentation
  • Uses Unstructured.IO to partition documents by title sections
  • Creates structured records for documents and text chunks

2. Entity Extraction

  • Uses GPT-4 with structured output to extract processes and steps from text
  • Employs Pydantic models to ensure consistent data structure:
    • Process: Contains name and sequence of steps
    • Step: Contains ID, description, parent/child relationships, and sequence information
  • Implements retry logic and error handling for robust extraction
  • Maintains context across chunks to handle processes that span multiple sections
  • Supports hierarchical steps with parent-child relationships

3. Neo4j Knowledge Graph Creation

The notebook creates two interconnected graph structures:

Lexical Graph:

  • Document nodes representing source files
  • Chunk nodes representing text sections
  • HAS_CHUNK relationships connecting documents to their chunks

Entity Graph:

  • Process nodes representing identified business processes
  • Step nodes representing individual process steps
  • NEXT_STEP relationships showing step sequences
  • HAS_CHILD_STEP relationships for hierarchical steps
  • HAS_FIRST_STEP relationships connecting processes to their starting steps

Bridge Connections:

  • HAS_ENTITY relationships linking chunks to the processes and steps they contain

Key Features

  • Hierarchical Process Modeling: Supports nested steps and sub-processes
  • Sequence Tracking: Maintains proper step ordering and dependencies
  • Context Preservation: Handles processes that span multiple document sections
  • Error Recovery: Robust handling of extraction failures with detailed logging
  • Graph Constraints: Implements proper database constraints for data integrity
  • Flexible Input: Works with various text document formats

Known Bugs

  • Leaf child steps sometimes are not connected to the next step in sequence, instead the parent step is connected to the next parent step

Example Knowledge Graph

The resulting knowledge graph enables powerful queries for process analysis, dependency mapping, and workflow optimization.

Below is the resulting knowledge graph from the process.ipynb notebook.

full-graph

And here is the isolated entity graph containing only a single Process node and the sequential Steps.

example-kg

About

Knowledge graph generation demo extracting processes and steps from instructional document

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors