Skip to content

Commit 483cbc2

Browse files
committed
add README
1 parent 6f70a0c commit 483cbc2

File tree

1 file changed

+291
-0
lines changed

1 file changed

+291
-0
lines changed

README.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Haystack x Bright Data Integration
2+
3+
[![PyPI version](https://badge.fury.io/py/haystack-brightdata.svg)](https://badge.fury.io/py/haystack-brightdata)
4+
[![Python Version](https://img.shields.io/pypi/pyversions/haystack-brightdata.svg)](https://pypi.org/project/haystack-brightdata/)
5+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
6+
7+
Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for:
8+
9+
- 🔍 **SERP API** - Search engine results from Google, Bing, Yahoo, and more
10+
- 🌐 **Web Unlocker** - Access geo-restricted and bot-protected websites
11+
- 📊 **Web Scraper** - Extract structured data from 43+ supported websites
12+
13+
## Features
14+
15+
- **Seamless Haystack Integration** - Works natively with Haystack 2.0+ pipelines
16+
- **43+ Supported Datasets** - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more
17+
- **Geo-Targeting** - Access content from specific countries
18+
- **Anti-Bot Bypass** - Automatically handle CAPTCHAs and bot detection
19+
- **Structured Data** - Get clean, structured JSON data ready for RAG pipelines
20+
- **Async Support** - Built-in async support for high-performance applications
21+
22+
## Installation
23+
24+
```bash
25+
pip install haystack-brightdata
26+
```
27+
28+
## Quick Start
29+
30+
### Prerequisites
31+
32+
1. Get your Bright Data API key from [https://brightdata.com/cp/api_access](https://brightdata.com/cp/api_access)
33+
2. Set the environment variable:
34+
35+
```bash
36+
export BRIGHT_DATA_API_KEY="your-api-key-here"
37+
```
38+
39+
### Example 1: SERP Search
40+
41+
```python
42+
from haystack_brightdata import BrightDataSERP
43+
44+
# Initialize the component
45+
serp = BrightDataSERP()
46+
47+
# Execute a search
48+
result = serp.run(
49+
query="Haystack AI framework tutorials",
50+
num_results=10,
51+
country="us"
52+
)
53+
54+
print(result["results"]) # Parsed JSON results
55+
```
56+
57+
### Example 2: Web Unlocker
58+
59+
```python
60+
from haystack_brightdata import BrightDataUnlocker
61+
62+
# Initialize the component
63+
unlocker = BrightDataUnlocker()
64+
65+
# Access a restricted website
66+
result = unlocker.run(
67+
url="https://example.com",
68+
country="gb",
69+
output_format="markdown"
70+
)
71+
72+
print(result["content"]) # Clean markdown content
73+
```
74+
75+
### Example 3: Web Scraper
76+
77+
```python
78+
from haystack_brightdata import BrightDataWebScraper
79+
80+
# Initialize the component
81+
scraper = BrightDataWebScraper()
82+
83+
# Extract Amazon product data
84+
result = scraper.run(
85+
dataset="amazon_product",
86+
url="https://www.amazon.com/dp/B08N5WRWNW"
87+
)
88+
89+
print(result["data"]) # Structured JSON data
90+
```
91+
92+
### Example 4: In a Haystack Pipeline
93+
94+
```python
95+
from haystack import Pipeline
96+
from haystack_brightdata import BrightDataSERP
97+
98+
# Create a pipeline
99+
pipeline = Pipeline()
100+
pipeline.add_component("search", BrightDataSERP())
101+
102+
# Run the pipeline
103+
result = pipeline.run({
104+
"search": {
105+
"query": "Python web scraping",
106+
"num_results": 20
107+
}
108+
})
109+
110+
print(result["search"]["results"])
111+
```
112+
113+
## Components
114+
115+
### BrightDataSERP
116+
117+
Execute search queries across multiple search engines with geo-targeting and result parsing.
118+
119+
**Parameters:**
120+
- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var)
121+
- `zone` (str): Bright Data zone name (default: "serp")
122+
- `default_search_engine` (str): Default search engine (default: "google")
123+
- `default_country` (str): Default country code (default: "us")
124+
- `default_language` (str): Default language code (default: "en")
125+
- `default_num_results` (int): Default number of results (default: 10)
126+
127+
**Outputs:**
128+
- `results` (str): Search results as JSON string (when `parse_results=True`, default) or raw HTML
129+
130+
### BrightDataUnlocker
131+
132+
Access geo-restricted and bot-protected websites with automatic CAPTCHA solving.
133+
134+
**Parameters:**
135+
- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var)
136+
- `zone` (str): Bright Data zone name (default: "unlocker")
137+
- `default_country` (str): Default country code (default: "us")
138+
- `default_output_format` (str): Default output format - html, markdown, or screenshot (default: "html")
139+
140+
**Outputs:**
141+
- `content` (str): Web page content in the specified format
142+
143+
### BrightDataWebScraper
144+
145+
Extract structured data from 43+ supported websites.
146+
147+
**Parameters:**
148+
- `bright_data_api_key` (Optional[str]): API key (defaults to `BRIGHT_DATA_API_KEY` env var)
149+
- `default_include_errors` (bool): Include errors in output (default: False)
150+
151+
**Outputs:**
152+
- `data` (str): Structured data as JSON string
153+
154+
**Helper Methods:**
155+
```python
156+
# Get all supported datasets
157+
datasets = BrightDataWebScraper.get_supported_datasets()
158+
159+
# Get info about a specific dataset
160+
info = BrightDataWebScraper.get_dataset_info("amazon_product")
161+
```
162+
163+
## Supported Datasets (43+)
164+
165+
### E-commerce (10)
166+
- Amazon: Products, Reviews, Search, Bestsellers
167+
- Walmart: Products, Seller
168+
- eBay, Home Depot, Zara, Etsy, Best Buy
169+
170+
### LinkedIn (5)
171+
- Person Profile, Company Profile, Job Listings, Posts, People Search
172+
173+
### Social Media (16)
174+
- **Instagram**: Profiles, Posts, Reels, Comments
175+
- **Facebook**: Posts, Marketplace, Company Reviews, Events
176+
- **TikTok**: Profiles, Posts, Shop, Comments
177+
- **YouTube**: Profiles, Videos, Comments
178+
- **X/Twitter**: Posts
179+
- **Reddit**: Posts
180+
181+
### Business Intelligence (2)
182+
- Crunchbase, ZoomInfo
183+
184+
### Search & Commerce (6)
185+
- Google Maps Reviews, Google Shopping, Google Play Store
186+
- Apple App Store, Zillow, Booking.com
187+
188+
### Other (5)
189+
- GitHub, Yahoo Finance, Reuters
190+
191+
[See full dataset list](https://github.com/brightdata/haystack-brightdata#supported-datasets)
192+
193+
## Advanced Usage
194+
195+
### Custom Zone Configuration
196+
197+
```python
198+
serp = BrightDataSERP(zone="my_custom_serp_zone")
199+
```
200+
201+
### Geo-Targeted Search
202+
203+
```python
204+
result = serp.run(
205+
query="local restaurants",
206+
country="fr", # France
207+
language="fr",
208+
num_results=20
209+
)
210+
```
211+
212+
### Multi-Format Web Unlocker
213+
214+
```python
215+
# Get as markdown
216+
markdown = unlocker.run(url="https://example.com", output_format="markdown")
217+
218+
# Get as screenshot
219+
screenshot = unlocker.run(url="https://example.com", output_format="screenshot")
220+
```
221+
222+
### Dataset-Specific Parameters
223+
224+
```python
225+
# LinkedIn people search
226+
result = scraper.run(
227+
dataset="linkedin_people_search",
228+
url="https://www.linkedin.com",
229+
first_name="John",
230+
last_name="Doe"
231+
)
232+
233+
# Google Maps reviews (last 7 days)
234+
result = scraper.run(
235+
dataset="google_maps_reviews",
236+
url="https://www.google.com/maps/place/...",
237+
days_limit="7"
238+
)
239+
```
240+
241+
## Environment Variables
242+
243+
- `BRIGHT_DATA_API_KEY` - Your Bright Data API key (required)
244+
- `REQUESTS_CA_BUNDLE` - Custom CA bundle for corporate proxies (optional)
245+
- `SSL_CERT_FILE` - Alternative SSL certificate file (optional)
246+
247+
## Requirements
248+
249+
- Python >= 3.8
250+
- haystack-ai >= 2.0.0
251+
- pydantic >= 2.0.0
252+
- requests >= 2.28.0
253+
- aiohttp >= 3.8.0
254+
255+
## Examples
256+
257+
Check out the [examples directory](https://github.com/brightdata/haystack-brightdata/tree/main/examples) for more detailed examples:
258+
259+
- `example_serp.py` - SERP API examples
260+
- `example_unlocker.py` - Web Unlocker examples
261+
- `example_scraper.py` - Web Scraper examples
262+
- `example_pipeline.py` - Pipeline integration examples
263+
264+
## Documentation
265+
266+
- [Bright Data API Documentation](https://docs.brightdata.com/)
267+
- [Haystack Documentation](https://docs.haystack.deepset.ai/)
268+
- [Component API Reference](https://github.com/brightdata/haystack-brightdata#api-reference)
269+
270+
## Contributing
271+
272+
Contributions are welcome! Please feel free to submit a Pull Request.
273+
274+
## License
275+
276+
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
277+
278+
## Support
279+
280+
- **Issues**: [GitHub Issues](https://github.com/brightdata/haystack-brightdata/issues)
281+
- **Bright Data Support**: [support@brightdata.com](mailto:support@brightdata.com)
282+
- **Haystack Community**: [Haystack Discord](https://discord.gg/haystack)
283+
284+
## Acknowledgments
285+
286+
- Built for [Haystack](https://haystack.deepset.ai/) by [deepset](https://www.deepset.ai/)
287+
- Powered by [Bright Data](https://brightdata.com/)
288+
289+
---
290+
291+
**Note**: You need a valid Bright Data subscription to use this package. Get started at [brightdata.com](https://brightdata.com/).

0 commit comments

Comments
 (0)