|
| 1 | +# Haystack x Bright Data Integration |
| 2 | + |
| 3 | +[](https://badge.fury.io/py/haystack-brightdata) |
| 4 | +[](https://pypi.org/project/haystack-brightdata/) |
| 5 | +[](https://opensource.org/licenses/Apache-2.0) |
| 6 | + |
| 7 | +Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for: |
| 8 | + |
| 9 | +- 🔍 **SERP API** - Search engine results from Google, Bing, Yahoo, and more |
| 10 | +- 🌐 **Web Unlocker** - Access geo-restricted and bot-protected websites |
| 11 | +- 📊 **Web Scraper** - Extract structured data from 43+ supported websites |
| 12 | + |
| 13 | +## Features |
| 14 | + |
| 15 | +- **Seamless Haystack Integration** - Works natively with Haystack 2.0+ pipelines |
| 16 | +- **43+ Supported Datasets** - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more |
| 17 | +- **Geo-Targeting** - Access content from specific countries |
| 18 | +- **Anti-Bot Bypass** - Automatically handle CAPTCHAs and bot detection |
| 19 | +- **Structured Data** - Get clean, structured JSON data ready for RAG pipelines |
| 20 | +- **Async Support** - Built-in async support for high-performance applications |
| 21 | + |
| 22 | +## Installation |
| 23 | + |
| 24 | +```bash |
| 25 | +pip install haystack-brightdata |
| 26 | +``` |
| 27 | + |
| 28 | +## Quick Start |
| 29 | + |
| 30 | +### Prerequisites |
| 31 | + |
| 32 | +1. Get your Bright Data API key from [https://brightdata.com/cp/api_access](https://brightdata.com/cp/api_access) |
| 33 | +2. Set the environment variable: |
| 34 | + |
| 35 | +```bash |
| 36 | +export BRIGHT_DATA_API_KEY="your-api-key-here" |
| 37 | +``` |
| 38 | + |
| 39 | +### Example 1: SERP Search |
| 40 | + |
| 41 | +```python |
| 42 | +from haystack_brightdata import BrightDataSERP |
| 43 | + |
| 44 | +# Initialize the component |
| 45 | +serp = BrightDataSERP() |
| 46 | + |
| 47 | +# Execute a search |
| 48 | +result = serp.run( |
| 49 | + query="Haystack AI framework tutorials", |
| 50 | + num_results=10, |
| 51 | + country="us" |
| 52 | +) |
| 53 | + |
| 54 | +print(result["results"]) # Parsed JSON results |
| 55 | +``` |
| 56 | + |
| 57 | +### Example 2: Web Unlocker |
| 58 | + |
| 59 | +```python |
| 60 | +from haystack_brightdata import BrightDataUnlocker |
| 61 | + |
| 62 | +# Initialize the component |
| 63 | +unlocker = BrightDataUnlocker() |
| 64 | + |
| 65 | +# Access a restricted website |
| 66 | +result = unlocker.run( |
| 67 | + url="https://example.com", |
| 68 | + country="gb", |
| 69 | + output_format="markdown" |
| 70 | +) |
| 71 | + |
| 72 | +print(result["content"]) # Clean markdown content |
| 73 | +``` |
| 74 | + |
| 75 | +### Example 3: Web Scraper |
| 76 | + |
| 77 | +```python |
| 78 | +from haystack_brightdata import BrightDataWebScraper |
| 79 | + |
| 80 | +# Initialize the component |
| 81 | +scraper = BrightDataWebScraper() |
| 82 | + |
| 83 | +# Extract Amazon product data |
| 84 | +result = scraper.run( |
| 85 | + dataset="amazon_product", |
| 86 | + url="https://www.amazon.com/dp/B08N5WRWNW" |
| 87 | +) |
| 88 | + |
| 89 | +print(result["data"]) # Structured JSON data |
| 90 | +``` |
| 91 | + |
| 92 | +### Example 4: In a Haystack Pipeline |
| 93 | + |
| 94 | +```python |
| 95 | +from haystack import Pipeline |
| 96 | +from haystack_brightdata import BrightDataSERP |
| 97 | + |
| 98 | +# Create a pipeline |
| 99 | +pipeline = Pipeline() |
| 100 | +pipeline.add_component("search", BrightDataSERP()) |
| 101 | + |
| 102 | +# Run the pipeline |
| 103 | +result = pipeline.run({ |
| 104 | + "search": { |
| 105 | + "query": "Python web scraping", |
| 106 | + "num_results": 20 |
| 107 | + } |
| 108 | +}) |
| 109 | + |
| 110 | +print(result["search"]["results"]) |
| 111 | +``` |
| 112 | + |
| 113 | +## Components |
| 114 | + |
| 115 | +### BrightDataSERP |
| 116 | + |
| 117 | +Execute search queries across multiple search engines with geo-targeting and result parsing. |
| 118 | + |
| 119 | +**Parameters:** |
| 120 | +- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var) |
| 121 | +- `zone` (str): Bright Data zone name (default: "serp") |
| 122 | +- `default_search_engine` (str): Default search engine (default: "google") |
| 123 | +- `default_country` (str): Default country code (default: "us") |
| 124 | +- `default_language` (str): Default language code (default: "en") |
| 125 | +- `default_num_results` (int): Default number of results (default: 10) |
| 126 | + |
| 127 | +**Outputs:** |
| 128 | +- `results` (str): Search results as JSON string (when `parse_results=True`, default) or raw HTML |
| 129 | + |
| 130 | +### BrightDataUnlocker |
| 131 | + |
| 132 | +Access geo-restricted and bot-protected websites with automatic CAPTCHA solving. |
| 133 | + |
| 134 | +**Parameters:** |
| 135 | +- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var) |
| 136 | +- `zone` (str): Bright Data zone name (default: "unlocker") |
| 137 | +- `default_country` (str): Default country code (default: "us") |
| 138 | +- `default_output_format` (str): Default output format - html, markdown, or screenshot (default: "html") |
| 139 | + |
| 140 | +**Outputs:** |
| 141 | +- `content` (str): Web page content in the specified format |
| 142 | + |
| 143 | +### BrightDataWebScraper |
| 144 | + |
| 145 | +Extract structured data from 43+ supported websites. |
| 146 | + |
| 147 | +**Parameters:** |
| 148 | +- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var) |
| 149 | +- `default_include_errors` (bool): Include errors in output (default: False) |
| 150 | + |
| 151 | +**Outputs:** |
| 152 | +- `data` (str): Structured data as JSON string |
| 153 | + |
| 154 | +**Helper Methods:** |
| 155 | +```python |
| 156 | +# Get all supported datasets |
| 157 | +datasets = BrightDataWebScraper.get_supported_datasets() |
| 158 | + |
| 159 | +# Get info about a specific dataset |
| 160 | +info = BrightDataWebScraper.get_dataset_info("amazon_product") |
| 161 | +``` |
| 162 | + |
| 163 | +## Supported Datasets (43+) |
| 164 | + |
| 165 | +### E-commerce (10) |
| 166 | +- Amazon: Products, Reviews, Search, Bestsellers |
| 167 | +- Walmart: Products, Seller |
| 168 | +- eBay, Home Depot, Zara, Etsy, Best Buy |
| 169 | + |
| 170 | +### LinkedIn (5) |
| 171 | +- Person Profile, Company Profile, Job Listings, Posts, People Search |
| 172 | + |
| 173 | +### Social Media (16) |
| 174 | +- **Instagram**: Profiles, Posts, Reels, Comments |
| 175 | +- **Facebook**: Posts, Marketplace, Company Reviews, Events |
| 176 | +- **TikTok**: Profiles, Posts, Shop, Comments |
| 177 | +- **YouTube**: Profiles, Videos, Comments |
| 178 | +- **X/Twitter**: Posts |
| 179 | +- **Reddit**: Posts |
| 180 | + |
| 181 | +### Business Intelligence (2) |
| 182 | +- Crunchbase, ZoomInfo |
| 183 | + |
| 184 | +### Search & Commerce (6) |
| 185 | +- Google Maps Reviews, Google Shopping, Google Play Store |
| 186 | +- Apple App Store, Zillow, Booking.com |
| 187 | + |
| 188 | +### Other (5) |
| 189 | +- GitHub, Yahoo Finance, Reuters |
| 190 | + |
| 191 | +[See full dataset list](https://github.com/brightdata/haystack-brightdata#supported-datasets) |
| 192 | + |
| 193 | +## Advanced Usage |
| 194 | + |
| 195 | +### Custom Zone Configuration |
| 196 | + |
| 197 | +```python |
| 198 | +serp = BrightDataSERP(zone="my_custom_serp_zone") |
| 199 | +``` |
| 200 | + |
| 201 | +### Geo-Targeted Search |
| 202 | + |
| 203 | +```python |
| 204 | +result = serp.run( |
| 205 | + query="local restaurants", |
| 206 | + country="fr", # France |
| 207 | + language="fr", |
| 208 | + num_results=20 |
| 209 | +) |
| 210 | +``` |
| 211 | + |
| 212 | +### Multi-Format Web Unlocker |
| 213 | + |
| 214 | +```python |
| 215 | +# Get as markdown |
| 216 | +markdown = unlocker.run(url="https://example.com", output_format="markdown") |
| 217 | + |
| 218 | +# Get as screenshot |
| 219 | +screenshot = unlocker.run(url="https://example.com", output_format="screenshot") |
| 220 | +``` |
| 221 | + |
| 222 | +### Dataset-Specific Parameters |
| 223 | + |
| 224 | +```python |
| 225 | +# LinkedIn people search |
| 226 | +result = scraper.run( |
| 227 | + dataset="linkedin_people_search", |
| 228 | + url="https://www.linkedin.com", |
| 229 | + first_name="John", |
| 230 | + last_name="Doe" |
| 231 | +) |
| 232 | + |
| 233 | +# Google Maps reviews (last 7 days) |
| 234 | +result = scraper.run( |
| 235 | + dataset="google_maps_reviews", |
| 236 | + url="https://www.google.com/maps/place/...", |
| 237 | + days_limit="7" |
| 238 | +) |
| 239 | +``` |
| 240 | + |
| 241 | +## Environment Variables |
| 242 | + |
| 243 | +- `BRIGHT_DATA_API_KEY` - Your Bright Data API key (required) |
| 244 | +- `REQUESTS_CA_BUNDLE` - Custom CA bundle for corporate proxies (optional) |
| 245 | +- `SSL_CERT_FILE` - Alternative SSL certificate file (optional) |
| 246 | + |
| 247 | +## Requirements |
| 248 | + |
| 249 | +- Python >= 3.8 |
| 250 | +- haystack-ai >= 2.0.0 |
| 251 | +- pydantic >= 2.0.0 |
| 252 | +- requests >= 2.28.0 |
| 253 | +- aiohttp >= 3.8.0 |
| 254 | + |
| 255 | +## Examples |
| 256 | + |
| 257 | +Check out the [examples directory](https://github.com/brightdata/haystack-brightdata/tree/main/examples) for more detailed examples: |
| 258 | + |
| 259 | +- `example_serp.py` - SERP API examples |
| 260 | +- `example_unlocker.py` - Web Unlocker examples |
| 261 | +- `example_scraper.py` - Web Scraper examples |
| 262 | +- `example_pipeline.py` - Pipeline integration examples |
| 263 | + |
| 264 | +## Documentation |
| 265 | + |
| 266 | +- [Bright Data API Documentation](https://docs.brightdata.com/) |
| 267 | +- [Haystack Documentation](https://docs.haystack.deepset.ai/) |
| 268 | +- [Component API Reference](https://github.com/brightdata/haystack-brightdata#api-reference) |
| 269 | + |
| 270 | +## Contributing |
| 271 | + |
| 272 | +Contributions are welcome! Please feel free to submit a Pull Request. |
| 273 | + |
| 274 | +## License |
| 275 | + |
| 276 | +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. |
| 277 | + |
| 278 | +## Support |
| 279 | + |
| 280 | +- **Issues**: [GitHub Issues](https://github.com/brightdata/haystack-brightdata/issues) |
| 281 | +- **Bright Data Support**: [support@brightdata.com](mailto:support@brightdata.com) |
| 282 | +- **Haystack Community**: [Haystack Discord](https://discord.gg/haystack) |
| 283 | + |
| 284 | +## Acknowledgments |
| 285 | + |
| 286 | +- Built for [Haystack](https://haystack.deepset.ai/) by [deepset](https://www.deepset.ai/) |
| 287 | +- Powered by [Bright Data](https://brightdata.com/) |
| 288 | + |
| 289 | +--- |
| 290 | + |
| 291 | +**Note**: You need a valid Bright Data subscription to use this package. Get started at [brightdata.com](https://brightdata.com/). |
0 commit comments