Cloud-native data extraction and business intelligence platform that aggregates and normalizes high-volume lead data from Google Maps APIs, Google My Business via Playwright, and arbitrary websites using distributed BullMQ queues, Redis, MongoDB Atlas, and TypeScript.
This project implements a high-performance, production-grade, unified data extraction system capable of scraping, parsing, normalizing, and storing business-related data from:
- Google Maps Places API
- Google My Business Keyword Search (Browser Automation)
- General Web Data Extraction (Dynamic & Static Websites)
The system supports concurrency, scaling, structured storage, job monitoring, and async processing using BullMQ + Redis. It stores normalized business entities in MongoDB Atlas and exposes a REST API for job orchestration, analytics, and future CRM integration for Saubh Tech Campaign automation and lead enrichment.
A scalable backend system that extracts and normalizes business details like name, address, phone, website, ratings, maps links, reviews, geo-coordinates, categories, operational data, and metadata. Uses APIs + Browser Automation + Web Scraping with clean storage and async processing.
- Extract high-quality lead/business data
- Process scraping asynchronously using a Queue Worker
- Normalize heterogeneous sources into a single
Placemodel - Provide REST APIs for CRM ingestion
- Enable scaling via distributed workers & Redis
- Google Maps data extracted & saved in MongoDB
- GMB keyword scraping operational via Playwright Worker
- Web scraping support for business websites
- No blocking or synchronous bottlenecks
- Queue must retry failed jobs
- MongoDB must store normalized structure
- Testable via curl, Postman, or API calls
- Node.js: >= 18.x
- TypeScript: >= 5.x
- MongoDB Atlas account
- Google Maps API Key (Billing Enabled)
- Redis Server or Memurai (Windows)
- Playwright (Installed Automatically)
git clone <repo_url> cd backend npm install npx playwright installPORT=5000 REDIS_HOST=localhost REDIS_PORT=6379 MONGO_URI=mongodb+srv://... GOOGLE_MAPS_API_KEY=xxxx
npm run dev npx ts-node src/queue/extractWorker.ts
| Method | Endpoint | Description |
|---|---|---|
| POST | /extract/google-maps | Triggers Google Places API extraction |
| POST | /extract/google-my-business | Triggers Playwright-based GMB scraping |
| POST | /extract/web-scrape | Triggers generic website extraction |
Client โ Express API โ BullMQ Queue โ Worker โ MongoDB Atlas
The frontend is intended for CRM visualization and trigger control. This backend supports any UI (Next.js recommended).
- Dashboard โ Trigger extraction jobs
- Display result table linked to MongoDB
- Filter by keyword, listId, location
- Network calls via fetch or axios
- Styling through Tailwind / MUI
- 200 Success / Data Found
- 202 Accepted (Job Queued)
- 400 Bad Input
- 401/403 Forbidden / Unauthorized API Key
- 404 Invalid Resource
- 500 Server Error / Worker Failure
This platform is designed as a full-scale enterprise data intelligence system that transforms raw web signals into structured, CRM-ready business entities.
- Multi-Source Data Mining โ Simultaneously extracts data from Google Maps API, Google My Business via Playwright, and generic websites.
- Distributed Job Processing โ Uses BullMQ with Redis to process millions of extraction jobs asynchronously.
- High-Volume Lead Generation โ Built for bulk keyword, location, and category based business discovery.
- Unified Data Model โ Normalizes heterogeneous data into a single
Placeschema. - Fault Tolerance โ Automatic retries, job recovery, and dead-letter queues.
- Cloud Native โ Works on Vercel, Docker, Kubernetes, and serverless platforms.
- CRM & Analytics Ready โ Outputs structured data for enrichment, scoring, and sales pipelines.
[Raw Web Data]
โ
[APIs + Browser Automation]
โ
[Normalized Business Intelligence]
| Layer | Technology | Role |
|---|---|---|
| API Gateway | Express.js + TypeScript | Accepts extraction requests |
| Queue | BullMQ + Redis | Distributed job orchestration |
| Scraping | Google Places API, Playwright | Data acquisition |
| Processing | Node.js Workers | Normalization & validation |
| Storage | MongoDB Atlas | Unified business records |
| Cloud | Vercel, Atlas, Redis Cloud | Production deployment |
[Client]
โ
[Express API]
โ
[BullMQ Queue] โ [Redis]
โ
[Distributed Workers]
โ
[Scraping Engines]
โ
[MongoDB Atlas]
1. User submits keyword + location 2. Express validates and queues job 3. Redis stores job metadata 4. Worker picks job 5. API / Browser scraping runs 6. Data normalized to Place schema 7. MongoDB Atlas stores results 8. Job status updated 9. CRM or dashboard consumes data
[User] โ [API Request] โ [Queue] โ [Worker] โ [Scrapers] โ [Normalizer] โ [Database]
| ID | Test Area | Command | Expected Output | Explanation |
|---|---|---|---|---|
| T-01 | Queue Trigger | curl POST /extract/google-maps | success=true, jobId | Request is accepted async |
| T-02 | DB Insert | Check Mongo Atlas | 50 docs saved | Normalized data persisted |
| T-03 | Worker Logs | npx ts-node worker | Processing Job, Completed | Worker fetches & saves |
| T-04 | ENV Load | console.log(process.env) | All keys visible | Secrets loaded |
| T-05 | Redis Ping | memurai-cli ping | PONG | Queue backend alive |
- Google Maps โ โ Data fetched and saved
- MongoDB Atlas โ โ Correct database and collection
- Queue + Redis/Memurai โ โ Functional
- Worker โ โ Async job execution complete
- API โ โ Accepting requests & returning job status
- Postman / Thunder Client
- memurai-cli โ
ping - MongoDB Compass โ visual DB checks
- Google Cloud API console quotas
- Browser DevTools for Playwright debug
- No API results? Enable billing and Places API (New)
- Worker not saving? Ensure
MONGO_URIincludes DB name - Queue Idle? Start Memurai/Redis first
- Browser blocked? Change User-Agent + use proxy
- Rate limit reached? Add delay & limiter
- Never commit .env to Git
- Use IP Whitelisting for MongoDB
- Restrict Google API to
IP + Domain - Use HTTPS for remote Worker
- Rotate keys quarterly
- Build as Node backend serverless APIs
- Store secrets in Vercelโs environment variables
- Use external worker deployment (separate process)
- MongoDB connects via cloud URI
npm run dev npx ts-node src/queue/extractWorker.ts curl -X POST http://localhost:5000/extract/google-maps memurai-cli ping
- Extraction is asynchronous
- Worker must be kept running
- GMB scraping may require proxies
- Quota costs apply for Google
- Reuse Playwright browser contexts (browserPool)
- Batch DB writes using
bulkWrite - Normalize schema once, extend via metadata
- Use distributed workers for scale
- Admin Dashboard with job tracking
- CRON scheduling for auto scraping
- Advanced analytics / Lead scoring
- CSV / Excel export via API
- S3 media scraping support
- Refactor into microservices if scaling grows
- Add billing monitoring for API usage
- Integrate lead enrichment services
- Fully functional enterprise scraper
- Normalized data model
- Async queues with retries
- Browser & API integration
- Production-ready & secure
โโโโโโโโโโโโโโโ
โ UI / CRM โ
โโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโโโโโโโ
โ Express API โ
โโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโโโโโโโ
โ BullMQ โ
โ Redis โ
โโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Distributed Workers โ
โโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Scraping Engines โ
โ (API + Browser) โ
โโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโ
โ MongoDB โ
โโโโโโโโโโโโโโโ
backend/
โโโ src/
โโโ modules/
โ โโโ google-maps/
โ โโโ google-my-business/
โ โโโ web-scraping/
โโโ queue/
โ โโโ queue.ts
โ โโโ extractWorker.ts
โโโ database/
โ โโโ place.model.ts
โโโ routes/
โ โโโ extract.routes.ts
โโโ utils/
โโโ app.ts
โโโ server.ts
This modular structure allows each extraction engine to evolve independently while sharing a common queue, database, and orchestration layer.
- Run Memurai/Redis
- Run Backend API โ
npm run dev - Run Worker โ
npx ts-node src/queue/extractWorker.ts - Send POST API request via curl to trigger scraping
- Show MongoDB Atlas documents ingestion live
All requirements of Saubh Tech Campaign data extraction pipeline are met with future-proof scalability, asynchronous processing, secure cloud storage, and enterprise compliance for data access, API billing, and storage privacy. System is production-ready and extendable to CRM, lead scoring, or marketing automation pipelines with minimal changes.