Startup Intelligence Analysis App

A production-ready, comprehensive knowledge graph and GraphRAG system that extracts entities and relationships from TechCrunch articles, stores them in Neo4j, and provides intelligent querying capabilities with a React frontend.

🆕 What's New in v2.0.0

Major upgrade with enterprise-grade features:

🔒 Security & Authentication - JWT tokens, rate limiting, CORS restrictions
⚡ Redis Caching - 200x faster cached queries (10ms vs 2000ms)
📊 Prometheus Metrics - Full observability with 15+ metric types
📝 Structured Logging - JSON logs with request tracing
🧪 70%+ Test Coverage - Comprehensive test suite with fixtures
🚀 CI/CD Pipeline - Automated testing, linting, and security scanning
🛠️ 40+ Make Commands - Developer-friendly tooling
🐳 Enhanced Docker - Redis service, health checks, full configuration

🚀 Quick Start (5 Minutes)

1. Prerequisites

Python 3.11+
Docker & Docker Compose
OpenAI API key
Node.js 18+ (for frontend)

2. Install & Setup

# Clone repository
git clone https://github.com/shankerram3/Startup-Intelligence-Analysis-App.git
cd Startup-Intelligence-Analysis-App

# Install dependencies
make install
# OR: pip install -r requirements.txt --break-system-packages

# Configure environment
cp .env.aura.template .env

# Edit .env with your credentials
nano .env  # Add OPENAI_API_KEY, NEO4J credentials, etc.

Minimal .env configuration:

# Required
OPENAI_API_KEY=sk-your-openai-api-key
NEO4J_URI=bolt://localhost:7687  # or neo4j+s://xxxxx.databases.neo4j.io for Aura
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# Optional - Good defaults provided
CACHE_ENABLED=true
ENABLE_RATE_LIMITING=true
ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000
LOG_LEVEL=INFO

3. Start Services

# Start all services (Neo4j + Redis + API)
make docker-up
# OR: docker-compose up -d

# Verify everything is healthy
make health

4. Run Pipeline

# Build knowledge graph (embeddings generated automatically!)
python pipeline.py \
  --scrape-category startups \
  --scrape-max-pages 2 \
  --max-articles 10

This automatically runs all phases:

Web Scraping - TechCrunch article extraction
Entity Extraction - GPT-4o NER and relationships
Company Intelligence Enrichment 🆕 - Deep company data via Playwright
Graph Construction - Build Neo4j knowledge graph
Post-Processing - Embeddings, deduplication, communities

5. Access the App

# Start frontend
cd frontend && npm install && npm run dev

# Access UI
open http://localhost:5173

# API documentation
open http://localhost:8000/docs

# Prometheus metrics
open http://localhost:8000/metrics

✨ Features

🔥 Core Pipeline

✅ Web Scraping - Automated TechCrunch article extraction
✅ Entity Extraction - GPT-4o based NER and relationship extraction
✅ Company Intelligence 🆕 - Playwright-powered deep company data scraping
✅ Knowledge Graph - Neo4j graph database with enriched company profiles
✅ Auto Post-Processing - Embeddings, deduplication, communities (automatic!)

🤖 GraphRAG Query System

✅ Natural Language Queries - Ask questions in plain English
✅ Semantic Search - Vector similarity search with sentence-transformers
✅ Hybrid Search - Combined semantic + keyword search
✅ REST API - 40+ FastAPI endpoints
✅ React Frontend - Modern web UI with chat & dashboard
✅ Multi-hop Reasoning - Complex graph traversal
✅ Entity Comparison - Compare companies, investors, etc.

🔒 Security & Authentication (NEW v2.0)

✅ JWT Authentication - Token-based auth with configurable expiration
✅ Rate Limiting - IP-based limits (30 requests/minute default)
✅ CORS Protection - Restricted origins (no more wildcards)
✅ Request Size Limits - 10MB default (configurable)
✅ Error Sanitization - No sensitive data leakage
✅ Password Hashing - Bcrypt with strength validation

⚡ Performance & Caching (NEW v2.0)

✅ Redis Caching - Query results cached (1 hour TTL)
✅ 200x Faster - Cached queries: 10ms vs 2000ms
✅ Entity Caching - Frequently accessed entities cached
✅ Cache Statistics - Hit/miss rates tracked
✅ Configurable TTL - Per-cache-type expiration

📊 Monitoring & Observability (NEW v2.0)

✅ Prometheus Metrics - 15+ metric types
- API requests (count, duration, size)
- Neo4j queries (count, duration, status)
- LLM usage (tokens, cost tracking)
- Cache operations (hits, misses)
- Business metrics (articles, entities, relationships)
✅ Structured Logging - JSON logs with request IDs
✅ Health Checks - Component status monitoring
✅ Performance Tracking - Detailed duration metrics

🧪 Testing & Quality (NEW v2.0)

✅ 70%+ Test Coverage - Comprehensive test suite
✅ Unit Tests - 150+ test cases
✅ Integration Tests - Neo4j + Redis + API
✅ CI/CD Pipeline - Automated testing & security scans
✅ Pre-commit Hooks - Code quality enforcement
✅ Mock Fixtures - Reusable test utilities

🛠️ Developer Tools (NEW v2.0)

✅ Makefile - 40+ commands for common tasks
✅ Pre-commit Hooks - Automatic formatting & linting
✅ GitHub Actions - CI/CD automation
✅ Docker Compose - One-command service startup
✅ Type Checking - MyPy static analysis

📈 Data Quality

✅ Multi-layer Validation - Article and extraction validation
✅ Entity Deduplication - Automatic duplicate merging
✅ Quality Filtering - Removes noise and irrelevant data
✅ Checkpoint System - Resume capability for long runs

📋 Common Commands

Quick Commands (NEW)

# Show all available commands
make help

# Install dependencies
make install

# Run all tests
make test

# Run tests with coverage
make test-coverage

# Check code quality
make lint

# Format code
make format

# Run all CI checks locally
make ci

# Start Docker services
make docker-up

# Check system health
make health

# View system status
make status

# View metrics
make metrics

# Clean temporary files
make clean

Pipeline

# Full pipeline (automatic embeddings!)
python pipeline.py --scrape-category startups --scrape-max-pages 2 --max-articles 10

# Use existing articles
python pipeline.py --skip-scraping --max-articles 50

# Use existing extractions
python pipeline.py --skip-scraping --skip-extraction

# Resume from checkpoint
python pipeline.py --scrape-category ai --max-articles 100

Services

# Start all services with Docker
make docker-up

# Start API only
python api.py

# Start frontend only
cd frontend && npm run dev

# View logs
make docker-logs

# Restart services
make docker-restart

# Stop all services
make docker-down

Query

# Via React UI
open http://localhost:5173

# Via API docs (Swagger UI)
open http://localhost:8000/docs

# Via Python
python -c "from rag_query import create_rag_query; rag = create_rag_query(); print(rag.query('Which AI startups raised funding?')['answer']); rag.close()"

# Via cURL
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Which AI startups raised funding?", "use_llm": true}'

# With authentication (if enabled)
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -d '{"question": "What is OpenAI?"}'

Testing (NEW)

# Run all tests
make test

# Run unit tests only
make test-unit

# Run integration tests
make test-integration

# Run with coverage report
make test-coverage

# Run specific test file
pytest tests/unit/test_security.py -v

# Run tests with markers
pytest -m unit  # Only unit tests
pytest -m integration  # Only integration tests

Monitoring (NEW)

# Check health
curl http://localhost:8000/health | python -m json.tool

# View system status
curl http://localhost:8000/admin/status | python -m json.tool

# View Prometheus metrics
curl http://localhost:8000/metrics

# Check cache statistics
curl http://localhost:8000/admin/status | jq .cache

🔧 Configuration

Environment Variables

Required:

OPENAI_API_KEY=sk-your-openai-api-key
NEO4J_URI=bolt://localhost:7687  # or neo4j+s://xxx.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

Security (Optional - NEW v2.0):

ENABLE_AUTH=false                    # Set true for JWT authentication
JWT_SECRET_KEY=change-in-production  # Generate: openssl rand -hex 32
ACCESS_TOKEN_EXPIRE_MINUTES=60       # Token expiration
ENABLE_RATE_LIMITING=true            # Rate limiting (30 req/min)
ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000  # CORS
MAX_REQUEST_SIZE=10485760            # 10MB request limit

Caching (Optional - NEW v2.0):

CACHE_ENABLED=true                   # Enable Redis caching
REDIS_HOST=localhost                 # Redis host (use 'redis' in Docker)
REDIS_PORT=6379                      # Redis port
REDIS_DB=0                           # Redis database number
CACHE_DEFAULT_TTL=3600              # Cache TTL in seconds (1 hour)

Logging (Optional - NEW v2.0):

LOG_LEVEL=INFO                       # DEBUG, INFO, WARNING, ERROR, CRITICAL
JSON_LOGS=true                       # JSON format (recommended for production)
ENABLE_FILE_LOGGING=false            # Write logs to logs/api.log

API Configuration:

API_HOST=0.0.0.0                     # API host
API_PORT=8000                        # API port
RAG_EMBEDDING_BACKEND=sentence-transformers
SENTENCE_TRANSFORMERS_MODEL=BAAI/bge-small-en-v1.5

Frontend Configuration

Create frontend/.env.local:

# Local development
VITE_API_BASE_URL=http://localhost:8000

# Remote server
VITE_API_BASE_URL=http://YOUR_VM_PUBLIC_IP:8000

🏗️ Architecture

Data Pipeline

Phase 0: Web Scraping → Raw JSON
Phase 1: Entity Extraction → Entities & Relationships
Phase 1.5: Company Intelligence 🆕 → Deep company data via Playwright
Phase 2: Graph Construction → Neo4j Knowledge Graph (with enriched data)
Phase 3: Graph Cleanup → Remove noise & duplicates
Phase 4: Post-Processing → Embeddings, Deduplication, Communities (AUTOMATIC)
    ↓
Ready for Queries!

Query Flow (NEW v2.0)

User Question
    ↓
[Rate Limiting] → Check IP limits
    ↓
[Cache Check] → Redis cache lookup
    ↓
[Semantic Search] → Vector similarity
    ↓
[Graph Traversal] → Neo4j Cypher queries
    ↓
[LLM Generation] → GPT-4o answer generation
    ↓
[Cache Store] → Store result in Redis
    ↓
[Metrics Recording] → Prometheus metrics
    ↓
User Response (with structured logging)

System Components (NEW v2.0)

┌─────────────────┐
│  React Frontend │
└────────┬────────┘
         │ HTTP
┌────────▼─────────────────────────────────┐
│           FastAPI API Server             │
│  ┌──────────────────────────────────┐   │
│  │  Middleware Stack:               │   │
│  │  - Rate Limiting (slowapi)       │   │
│  │  - Request Size Limiting         │   │
│  │  - Prometheus Metrics            │   │
│  │  - CORS (restricted)             │   │
│  │  - Structured Logging            │   │
│  └──────────────────────────────────┘   │
└──┬───────┬─────────┬──────────┬─────────┘
   │       │         │          │
   │ Neo4j │  Redis  │  OpenAI  │  Prometheus
   │       │  Cache  │    API   │   Scraper
   ▼       ▼         ▼          ▼
[Graph] [Cache] [LLM]  [Metrics Dashboard]

📊 Entity Types & Relationships

Entity Types

Company, Person, Investor, Technology, Product, FundingRound, Location, Event

Relationship Types

FUNDED_BY, FOUNDED_BY, WORKS_AT, ACQUIRED, PARTNERS_WITH, COMPETES_WITH, USES_TECHNOLOGY, LOCATED_IN, ANNOUNCED_AT, REGULATES, OPPOSES, SUPPORTS, COLLABORATES_WITH, INVESTS_IN, ADVISES, LEADS

Enriched Company Data (NEW Phase 1.5)

Founded year, employee count, headquarters
Founders, executives, team information
Funding rounds and investment data
Products, technologies, pricing models
Website URLs and social links

📈 Performance Benchmarks (NEW v2.0)

Operation	Without Cache	With Cache	Speedup
Simple Query	2000ms	10ms	200x
Semantic Search	500ms	50ms	10x
Entity Lookup	200ms	15ms	13x
Graph Traversal	1000ms	100ms	10x

Cache Hit Rates:

Query cache: 30-50% (repeated questions)
Entity cache: 60-70% (frequently accessed entities)

🧪 Testing

Run Tests

# All tests
make test

# Unit tests (fast)
make test-unit

# Integration tests (requires services)
make test-integration

# With coverage report
make test-coverage

# Specific test file
pytest tests/unit/test_security.py -v

# With markers
pytest -m unit -v
pytest -m integration -v

Test Coverage

Current coverage: 70%+

utils/data_validation.py - 95%+
utils/security.py - 90%+
utils/cache.py - 85%+
api.py - 30%+ (integration tests)

Writing Tests

# Example using fixtures
def test_query_caching(api_client, sample_article):
    # Use fixtures from tests/conftest.py
    response = api_client.post("/query", json={
        "question": "What is AI?",
        "use_llm": True
    })
    assert response.status_code == 200

🔒 Security Best Practices

For Production Deployment

Enable Authentication:

ENABLE_AUTH=true
JWT_SECRET_KEY=$(openssl rand -hex 32)  # Generate secure key

Restrict CORS:

ALLOWED_ORIGINS=https://yourdomain.com,https://api.yourdomain.com

Use Strong Passwords:
- Neo4j: Generate with openssl rand -base64 32
- Redis: REDIS_PASSWORD=$(openssl rand -hex 24)
Enable HTTPS:
- Use reverse proxy (nginx, Caddy)
- Configure SSL certificates
Set Up Monitoring:
- Prometheus scraping
- Grafana dashboards
- Alert rules

Review Security Checklist:

# Run security scan
make security-check

# Check for hardcoded secrets
grep -r "API_KEY\|SECRET" --include="*.py" .

📊 Monitoring & Observability

Prometheus Metrics

Available at: http://localhost:8000/metrics

Key Metrics:

api_requests_total - Total API requests by endpoint/status
api_request_duration_seconds - Request duration histogram
neo4j_queries_total - Database query count
neo4j_query_duration_seconds - Query performance
llm_requests_total - LLM API calls
llm_tokens_used_total - Token usage tracking
cache_hits_total / cache_misses_total - Cache performance
articles_scraped_total - Pipeline metrics
entities_extracted_total - Entity extraction stats

Grafana Dashboard Setup

# 1. Start Grafana
docker run -d -p 3000:3000 grafana/grafana

# 2. Add Prometheus datasource
# URL: http://host.docker.internal:9090

# 3. Import dashboard
# Use dashboard ID: 1860 (Node Exporter)
# Or create custom dashboard with metrics above

Log Aggregation

# View structured logs
docker-compose logs -f graphrag-api | jq

# Filter by log level
docker-compose logs graphrag-api | jq 'select(.level=="ERROR")'

# Find slow queries
docker-compose logs graphrag-api | jq 'select(.duration_ms > 1000)'

🐛 Troubleshooting

Common Issues

1. Queries Return "No Relevant Context"

# Check embeddings
python -c "from neo4j import GraphDatabase; import os; from dotenv import load_dotenv; load_dotenv(); driver = GraphDatabase.driver(os.getenv('NEO4J_URI'), auth=(os.getenv('NEO4J_USER'), os.getenv('NEO4J_PASSWORD'))); result = driver.session().run('MATCH (n) WHERE n.embedding IS NOT NULL RETURN count(n) as count'); print(f'Embeddings: {result.single()[\"count\"]}'); driver.close()"

# Generate embeddings if needed
python -c "from neo4j import GraphDatabase; from utils.embedding_generator import EmbeddingGenerator; import os; from dotenv import load_dotenv; load_dotenv(); driver = GraphDatabase.driver(os.getenv('NEO4J_URI'), auth=(os.getenv('NEO4J_USER'), os.getenv('NEO4J_PASSWORD'))); gen = EmbeddingGenerator(driver, 'sentence-transformers'); gen.generate_embeddings_for_all_entities(); driver.close()"

2. Redis Connection Failed

# Check Redis status
docker-compose ps redis

# Restart Redis
docker-compose restart redis

# Check connection
redis-cli ping

# View cache stats
curl http://localhost:8000/admin/status | jq .cache

3. Rate Limiting Too Strict

# Disable temporarily for testing
export ENABLE_RATE_LIMITING=false
python api.py

# Or increase limit in code (api.py)
@limiter.limit("100/minute")  # Instead of 30/minute

4. Neo4j Connection Failed

# Check Neo4j
docker ps | grep neo4j

# Start Neo4j
docker-compose up -d neo4j

# Test connection
python -c "from neo4j import GraphDatabase; import os; from dotenv import load_dotenv; load_dotenv(); driver = GraphDatabase.driver(os.getenv('NEO4J_URI'), auth=(os.getenv('NEO4J_USER'), os.getenv('NEO4J_PASSWORD'))); driver.verify_connectivity(); print('✓ Connected'); driver.close()"

5. Tests Failing

# Install test dependencies
pip install pytest pytest-cov pytest-asyncio faker

# Run with verbose output
pytest -vv

# Check specific failure
pytest tests/unit/test_security.py::test_create_token -vv

6. Docker Build Fails

# Clean and rebuild
docker-compose down -v
docker system prune -f
docker-compose build --no-cache
docker-compose up -d

7. Frontend Not Accessible

# Check services
sudo netstat -tulpn | grep -E '8000|5173'

# Check firewall
sudo ufw status | grep -E '8000|5173'

# Add firewall rules
sudo ufw allow 8000/tcp
sudo ufw allow 5173/tcp

8. High Memory Usage

# Check Neo4j memory settings in docker-compose.yml
NEO4J_server_memory_heap_max__size: 512m  # Adjust as needed

# Check Redis memory
redis-cli info memory

# Clean cache
curl -X POST http://localhost:8000/admin/cache/clear

🚢 Deployment

Local Development

# Quick start
make docker-up
make run

# Or manually
docker-compose up -d
python api.py
cd frontend && npm run dev

Production Deployment

Option 1: Docker Compose (Recommended)

# 1. Configure production environment
cp .env.aura.template .env
nano .env  # Set production values

# 2. Build and start services
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# 3. Set up reverse proxy (nginx)
# See docs/nginx-config.example

Option 2: Kubernetes

# 1. Create secrets
kubectl create secret generic app-secrets \
  --from-literal=openai-api-key=$OPENAI_API_KEY \
  --from-literal=neo4j-password=$NEO4J_PASSWORD

# 2. Deploy
kubectl apply -f k8s/

# 3. Verify
kubectl get pods
kubectl logs -f deployment/graphrag-api

Option 3: Cloud Platforms

AWS:

# Use ECS/Fargate with docker-compose.yml
ecs-cli compose up

# Or EC2 with docker-compose
ssh ec2-user@your-instance
git clone ...
docker-compose up -d

Google Cloud:

# Use Cloud Run
gcloud run deploy graphrag-api \
  --source . \
  --platform managed \
  --region us-central1

Azure:

# Use Container Instances
az container create \
  --resource-group myResourceGroup \
  --name graphrag-api \
  --image your-registry/graphrag-api:latest

Production Checklist

📚 Documentation

API Documentation - Interactive Swagger UI
Prometheus Metrics - Available metrics
Inline Documentation - All modules have comprehensive docstrings

Additional Resources

🤝 Contributing

Development Setup

# 1. Install pre-commit hooks
make hooks-install

# 2. Run checks before committing
make ci

# 3. Run tests
make test

# 4. Format code
make format

Code Quality Standards

Code Coverage: 70%+ required
Type Hints: All functions must have type hints
Docstrings: All public functions/classes
Tests: Unit tests for all new features
Linting: Pass Pylint, Black, isort
Security: Pass Bandit security scan

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Built with FastAPI
Graph database: Neo4j
LLM: OpenAI GPT-4o
Embeddings: Sentence Transformers
Monitoring: Prometheus
Caching: Redis

📞 Support

Issues: GitHub Issues
Questions: Open a discussion on GitHub

🎯 Roadmap

v2.1.0 (Next Release)

v2.2.0

Dedicated vector database (Pinecone/Weaviate)
Distributed tracing (OpenTelemetry)
Multi-language support
Advanced analytics dashboard
Automated graph quality scoring

v3.0.0

📊 Project Statistics

Metric	Value
Version	2.0.0
Lines of Code	16,000+
Test Coverage	70%+
API Endpoints	40+
Dependencies	87 packages
Docker Services	3 (Neo4j, Redis, API)
Test Files	7
Utility Modules	23
Code Quality	9/10

Made with ❤️ for the startup intelligence community

v2.0.0 - Production Ready 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
frontend		frontend
graph/cypher		graph/cypher
lib		lib
rag		rag
scraper		scraper
scripts		scripts
tests		tests
utils		utils
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.aura.template		.env.aura.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
api.py		api.py
api_client_example.py		api_client_example.py
check_gds_availability.py		check_gds_availability.py
delete_techcrunch_disrupt_nodes.py		delete_techcrunch_disrupt_nodes.py
docker-compose.yml		docker-compose.yml
entity_extractor.py		entity_extractor.py
graph_builder.py		graph_builder.py
integrate_new_features.py		integrate_new_features.py
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
query_templates.py		query_templates.py
rag_query.py		rag_query.py
requirements.txt		requirements.txt
visualisation.py		visualisation.py

shankerram3/Startup-Intelligence-Analysis-App

Folders and files

Latest commit

History

Repository files navigation