Skip to content

Cloud-native, event-driven business intelligence and lead-mining engine integrating Google Maps API, Playwright-based GMB automation, and adaptive web crawling with BullMQ, Redis, MongoDB Atlas, and TypeScript, delivering distributed, idempotent, fault-tolerant data pipelines, schema-normalized entities, and CRM-ready intelligence at scale.

Notifications You must be signed in to change notification settings

bitsandbrains/enterprise-business-intelligence-extraction-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿท๏ธ Project Title

Enterprise-Scale Unified Business Intelligence & Lead Mining Engine

Cloud-native data extraction and business intelligence platform that aggregates and normalizes high-volume lead data from Google Maps APIs, Google My Business via Playwright, and arbitrary websites using distributed BullMQ queues, Redis, MongoDB Atlas, and TypeScript.


๐Ÿงพ Executive Summary

This project implements a high-performance, production-grade, unified data extraction system capable of scraping, parsing, normalizing, and storing business-related data from:

  • Google Maps Places API
  • Google My Business Keyword Search (Browser Automation)
  • General Web Data Extraction (Dynamic & Static Websites)

The system supports concurrency, scaling, structured storage, job monitoring, and async processing using BullMQ + Redis. It stores normalized business entities in MongoDB Atlas and exposes a REST API for job orchestration, analytics, and future CRM integration for Saubh Tech Campaign automation and lead enrichment.


๐Ÿงฉ Project Overview

A scalable backend system that extracts and normalizes business details like name, address, phone, website, ratings, maps links, reviews, geo-coordinates, categories, operational data, and metadata. Uses APIs + Browser Automation + Web Scraping with clean storage and async processing.


๐ŸŽฏ Objectives & Goals

  • Extract high-quality lead/business data
  • Process scraping asynchronously using a Queue Worker
  • Normalize heterogeneous sources into a single Place model
  • Provide REST APIs for CRM ingestion
  • Enable scaling via distributed workers & Redis

โœ… Acceptance Criteria

  • Google Maps data extracted & saved in MongoDB
  • GMB keyword scraping operational via Playwright Worker
  • Web scraping support for business websites
  • No blocking or synchronous bottlenecks
  • Queue must retry failed jobs
  • MongoDB must store normalized structure
  • Testable via curl, Postman, or API calls

๐Ÿ’ป Prerequisites

  • Node.js: >= 18.x
  • TypeScript: >= 5.x
  • MongoDB Atlas account
  • Google Maps API Key (Billing Enabled)
  • Redis Server or Memurai (Windows)
  • Playwright (Installed Automatically)

โš™๏ธ Installation & Setup

git clone <repo_url>
cd backend
npm install
npx playwright install

.env

PORT=5000 REDIS_HOST=localhost REDIS_PORT=6379 MONGO_URI=mongodb+srv://... GOOGLE_MAPS_API_KEY=xxxx

npm run dev
npx ts-node src/queue/extractWorker.ts

๐Ÿ”— API Documentation

MethodEndpointDescription
POST/extract/google-mapsTriggers Google Places API extraction
POST/extract/google-my-businessTriggers Playwright-based GMB scraping
POST/extract/web-scrapeTriggers generic website extraction
Client โ†’ Express API โ†’ BullMQ Queue โ†’ Worker โ†’ MongoDB Atlas

๐Ÿ–ฅ๏ธ UI / Frontend (Scope Placeholder)

The frontend is intended for CRM visualization and trigger control. This backend supports any UI (Next.js recommended).

  • Dashboard โ†’ Trigger extraction jobs
  • Display result table linked to MongoDB
  • Filter by keyword, listId, location
  • Network calls via fetch or axios
  • Styling through Tailwind / MUI

๐Ÿ”ข Status Codes

  • 200 Success / Data Found
  • 202 Accepted (Job Queued)
  • 400 Bad Input
  • 401/403 Forbidden / Unauthorized API Key
  • 404 Invalid Resource
  • 500 Server Error / Worker Failure

๐Ÿš€ Features

This platform is designed as a full-scale enterprise data intelligence system that transforms raw web signals into structured, CRM-ready business entities.

  • Multi-Source Data Mining โ€“ Simultaneously extracts data from Google Maps API, Google My Business via Playwright, and generic websites.
  • Distributed Job Processing โ€“ Uses BullMQ with Redis to process millions of extraction jobs asynchronously.
  • High-Volume Lead Generation โ€“ Built for bulk keyword, location, and category based business discovery.
  • Unified Data Model โ€“ Normalizes heterogeneous data into a single Place schema.
  • Fault Tolerance โ€“ Automatic retries, job recovery, and dead-letter queues.
  • Cloud Native โ€“ Works on Vercel, Docker, Kubernetes, and serverless platforms.
  • CRM & Analytics Ready โ€“ Outputs structured data for enrichment, scoring, and sales pipelines.
[Raw Web Data]
        โ†“
[APIs + Browser Automation]
        โ†“
[Normalized Business Intelligence]

๐Ÿงฑ Tech Stack & Architecture

LayerTechnologyRole
API GatewayExpress.js + TypeScriptAccepts extraction requests
QueueBullMQ + RedisDistributed job orchestration
ScrapingGoogle Places API, PlaywrightData acquisition
ProcessingNode.js WorkersNormalization & validation
StorageMongoDB AtlasUnified business records
CloudVercel, Atlas, Redis CloudProduction deployment
[Client]
    โ†“
[Express API]
    โ†“
[BullMQ Queue] โ‡† [Redis]
    โ†“
[Distributed Workers]
    โ†“
[Scraping Engines]
    โ†“
[MongoDB Atlas]

๐Ÿ› ๏ธ Workflow & Implementation

1. User submits keyword + location
2. Express validates and queues job
3. Redis stores job metadata
4. Worker picks job
5. API / Browser scraping runs
6. Data normalized to Place schema
7. MongoDB Atlas stores results
8. Job status updated
9. CRM or dashboard consumes data
[User]
   โ†“
[API Request]
   โ†“
[Queue]
   โ†“
[Worker]
   โ†“
[Scrapers]
   โ†“
[Normalizer]
   โ†“
[Database]

๐Ÿงช Testing & Validation

IDTest AreaCommandExpected OutputExplanation
T-01Queue Triggercurl POST /extract/google-mapssuccess=true, jobIdRequest is accepted async
T-02DB InsertCheck Mongo Atlas50 docs savedNormalized data persisted
T-03Worker Logsnpx ts-node workerProcessing Job, CompletedWorker fetches & saves
T-04ENV Loadconsole.log(process.env)All keys visibleSecrets loaded
T-05Redis Pingmemurai-cli pingPONGQueue backend alive

๐Ÿ” Validation Summary

  • Google Maps โ†’ โœ” Data fetched and saved
  • MongoDB Atlas โ†’ โœ” Correct database and collection
  • Queue + Redis/Memurai โ†’ โœ” Functional
  • Worker โ†’ โœ” Async job execution complete
  • API โ†’ โœ” Accepting requests & returning job status

๐Ÿงฐ Verification Tools & Commands

  • Postman / Thunder Client
  • memurai-cli โ†’ ping
  • MongoDB Compass โ†’ visual DB checks
  • Google Cloud API console quotas
  • Browser DevTools for Playwright debug

๐Ÿงฏ Troubleshooting & Debugging

  • No API results? Enable billing and Places API (New)
  • Worker not saving? Ensure MONGO_URI includes DB name
  • Queue Idle? Start Memurai/Redis first
  • Browser blocked? Change User-Agent + use proxy
  • Rate limit reached? Add delay & limiter

๐Ÿ”’ Security & Secrets

  • Never commit .env to Git
  • Use IP Whitelisting for MongoDB
  • Restrict Google API to IP + Domain
  • Use HTTPS for remote Worker
  • Rotate keys quarterly

โ˜๏ธ Deployment (Vercel)

  • Build as Node backend serverless APIs
  • Store secrets in Vercelโ€™s environment variables
  • Use external worker deployment (separate process)
  • MongoDB connects via cloud URI

โšก Quick-Start Cheat Sheet

npm run dev
npx ts-node src/queue/extractWorker.ts
curl -X POST http://localhost:5000/extract/google-maps
memurai-cli ping

๐Ÿงพ Usage Notes

  • Extraction is asynchronous
  • Worker must be kept running
  • GMB scraping may require proxies
  • Quota costs apply for Google

๐Ÿง  Performance & Optimization

  • Reuse Playwright browser contexts (browserPool)
  • Batch DB writes using bulkWrite
  • Normalize schema once, extend via metadata
  • Use distributed workers for scale

๐ŸŒŸ Enhancements & Features

  • Admin Dashboard with job tracking
  • CRON scheduling for auto scraping
  • Advanced analytics / Lead scoring
  • CSV / Excel export via API
  • S3 media scraping support

๐Ÿงฉ Maintenance & Future Work

  • Refactor into microservices if scaling grows
  • Add billing monitoring for API usage
  • Integrate lead enrichment services

๐Ÿ† Key Achievements

  • Fully functional enterprise scraper
  • Normalized data model
  • Async queues with retries
  • Browser & API integration
  • Production-ready & secure

๐Ÿงฎ High-Level Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ UI / CRM    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Express API โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ BullMQ      โ”‚
โ”‚ Redis       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Distributed Workers โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Scraping Engines    โ”‚
โ”‚ (API + Browser)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MongoDB     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ—‚๏ธ Project Structure

backend/
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ modules/
    โ”‚   โ”œโ”€โ”€ google-maps/
    โ”‚   โ”œโ”€โ”€ google-my-business/
    โ”‚   โ””โ”€โ”€ web-scraping/
    โ”œโ”€โ”€ queue/
    โ”‚   โ”œโ”€โ”€ queue.ts
    โ”‚   โ””โ”€โ”€ extractWorker.ts
    โ”œโ”€โ”€ database/
    โ”‚   โ””โ”€โ”€ place.model.ts
    โ”œโ”€โ”€ routes/
    โ”‚   โ””โ”€โ”€ extract.routes.ts
    โ”œโ”€โ”€ utils/
    โ”œโ”€โ”€ app.ts
    โ””โ”€โ”€ server.ts

This modular structure allows each extraction engine to evolve independently while sharing a common queue, database, and orchestration layer.


๐Ÿงญ How to Demonstrate Live

  1. Run Memurai/Redis
  2. Run Backend API โ†’ npm run dev
  3. Run Worker โ†’ npx ts-node src/queue/extractWorker.ts
  4. Send POST API request via curl to trigger scraping
  5. Show MongoDB Atlas documents ingestion live

๐Ÿ’ก Summary, Closure & Compliance

All requirements of Saubh Tech Campaign data extraction pipeline are met with future-proof scalability, asynchronous processing, secure cloud storage, and enterprise compliance for data access, API billing, and storage privacy. System is production-ready and extendable to CRM, lead scoring, or marketing automation pipelines with minimal changes.

About

Cloud-native, event-driven business intelligence and lead-mining engine integrating Google Maps API, Playwright-based GMB automation, and adaptive web crawling with BullMQ, Redis, MongoDB Atlas, and TypeScript, delivering distributed, idempotent, fault-tolerant data pipelines, schema-normalized entities, and CRM-ready intelligence at scale.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published