A PHP package that fetches product information from any URL and returns structured data. It parses JSON-LD structured data, Open Graph meta tags, and HTML image elements to extract product details like name, description, price, and images.
composer require givetwice/product-info-fetcheruse GiveTwice\ProductInfoFetcher\ProductInfoFetcher;
$product = (new ProductInfoFetcher('https://example.com/product'))
->fetchAndParse();
// Core fields
echo $product->name; // "iPhone 15 Pro"
echo $product->description; // "The latest iPhone with A17 Pro chip"
echo $product->priceInCents; // 99900
echo $product->priceCurrency; // "USD"
echo $product->url; // "https://example.com/product"
echo $product->imageUrl; // "https://example.com/images/iphone.jpg"
echo $product->allImageUrls; // ["https://...", "https://..."] (all found images)
// Additional fields
echo $product->brand; // "Apple"
echo $product->sku; // "IPHONE15PRO-256"
echo $product->gtin; // "0194253392200"
echo $product->availability; // ProductAvailability::InStock
echo $product->condition; // ProductCondition::New
echo $product->rating; // 4.8
echo $product->reviewCount; // 1250
// For display purposes
echo $product->getFormattedPrice(); // "USD 999.00"Prices are stored as integers in cents to avoid floating-point precision issues. This follows the same approach used by payment systems like Stripe.
$product->priceInCents; // 139099 (integer)
$product->priceCurrency; // "EUR" (ISO 4217 currency code)
// For display
$product->getFormattedPrice(); // "EUR 1390.99"
// For calculations (no floating-point issues)
$total = $product->priceInCents * $quantity;
$displayPrice = number_format($total / 100, 2);The parser normalizes various price formats:
- String prices:
"999.00"→99900 - Integer prices:
1479→147900 - European format:
"1.234,56"→123456
The availability and condition fields return enum instances:
use GiveTwice\ProductInfoFetcher\Enum\ProductAvailability;
use GiveTwice\ProductInfoFetcher\Enum\ProductCondition;
// Availability values
ProductAvailability::InStock
ProductAvailability::OutOfStock
ProductAvailability::PreOrder
ProductAvailability::BackOrder
ProductAvailability::Discontinued
// Condition values
ProductCondition::New
ProductCondition::Used
ProductCondition::Refurbished
ProductCondition::Damaged
// Usage
if ($product->availability === ProductAvailability::InStock) {
// Product is available
}
// Get string value
$product->availability?->value; // "InStock"The imageUrl field contains the primary image (first found). The allImageUrls array contains all unique images found across all sources (JSON-LD, meta tags, and HTML image elements):
$product->imageUrl; // Primary image (first found)
$product->allImageUrls; // All unique images from all sources
// Example: different resolutions from different sources
// [0] "http://example.com/product-370x370.jpg" (from JSON-LD)
// [1] "//example.com/product-big.jpg" (from og:image)This is useful when sources provide different image sizes or when you want fallback options.
$product = (new ProductInfoFetcher('https://example.com/product'))
->setUserAgent('MyApp/1.0 (https://myapp.com)')
->setTimeout(10)
->setConnectTimeout(5)
->setAcceptLanguage('nl-BE,nl;q=0.9,en;q=0.8')
->fetchAndParse();For sites with stricter bot detection, you can add extra HTTP headers to mimic a real browser:
$product = (new ProductInfoFetcher('https://example.com/product'))
->withExtraHeaders([
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Cache-Control' => 'no-cache',
'DNT' => '1',
'Sec-CH-UA' => '"Google Chrome";v="131", "Chromium";v="131"',
'Sec-CH-UA-Mobile' => '?0',
'Sec-CH-UA-Platform' => '"macOS"',
'Sec-Fetch-Dest' => 'document',
'Sec-Fetch-Mode' => 'navigate',
'Sec-Fetch-Site' => 'none',
'Sec-Fetch-User' => '?1',
'Upgrade-Insecure-Requests' => '1',
])
->fetchAndParse();Extra headers are merged with defaults and can override them. Multiple withExtraHeaders() calls can be chained.
Route requests through an HTTP proxy:
$product = (new ProductInfoFetcher('https://example.com/product'))
->viaProxy('http://proxy.example.com:3128')
->fetchAndParse();
// With authentication
$product = (new ProductInfoFetcher('https://example.com/product'))
->viaProxy('http://username:password@proxy.example.com:3128')
->fetchAndParse();The proxy is used for both regular HTTP requests (via Guzzle/cURL) and headless browser requests (via Chrome's --proxy-server flag). Proxy authentication is handled automatically.
Some sites use advanced bot protection (Akamai, Cloudflare) that blocks simple HTTP requests. For these sites, you can use a headless Chrome browser via Puppeteer to fetch the page.
Prefer Headless (recommended for known-protected sites):
$product = (new ProductInfoFetcher('https://example.com/product'))
->preferHeadless()
->fetchAndParse();Use preferHeadless() when you know the site has bot protection. This skips the HTTP request entirely and uses headless Chrome directly. This is more efficient because attempting an HTTP request first can get your IP flagged, causing subsequent headless requests to also fail.
Fallback Mode:
$product = (new ProductInfoFetcher('https://example.com/product'))
->enableHeadlessFallback()
->fetchAndParse();When enabled, the fetcher will:
- First attempt a normal HTTP request
- If blocked with a 403 status, automatically retry using a headless Chrome browser
- Parse the resulting HTML as usual
Use enableHeadlessFallback() when you're unsure if the site has bot protection and want to try the faster HTTP request first.
Requirements:
To use the headless fallback, you need Node.js 18+ and Puppeteer installed:
npm install puppeteer@^23.0 puppeteer-extra@^3.3 puppeteer-extra-plugin-stealth@^2.11On Linux servers, you may also need system dependencies for headless Chrome:
# Ubuntu 24.04+
apt-get install -y libnss3 libatk1.0-0t64 libatk-bridge2.0-0t64 \
libcups2t64 libdrm2 libxkbcommon0 libxcomposite1 \
libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2t64 \
libpango-1.0-0 libcairo2
# Ubuntu 22.04 and earlier
apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 \
libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2Custom Paths:
$product = (new ProductInfoFetcher('https://example.com/product'))
->enableHeadlessFallback()
->setNodeBinary('/usr/local/bin/node')
->setChromePath('/usr/bin/chromium')
->fetchAndParse();$fetcher = new ProductInfoFetcher('https://example.com/product');
$fetcher->fetch();
$product = $fetcher->parse();$product = (new ProductInfoFetcher())
->setHtml($html)
->parse();$product = (new ProductInfoFetcher($url))->fetchAndParse();
$data = $product->toArray();
// [
// 'name' => 'iPhone 15 Pro',
// 'description' => 'The latest iPhone...',
// 'url' => 'https://example.com/product',
// 'priceInCents' => 99900,
// 'priceCurrency' => 'USD',
// 'imageUrl' => 'https://example.com/image.jpg',
// 'allImageUrls' => ['https://...', 'https://...'],
// 'brand' => 'Apple',
// 'sku' => 'IPHONE15PRO-256',
// 'gtin' => '0194253392200',
// 'availability' => 'InStock',
// 'condition' => 'New',
// 'rating' => 4.8,
// 'reviewCount' => 1250,
// ]if ($product->isComplete()) {
// Product has name and description
}The package attempts to extract product information in the following order:
- JSON-LD - Looks for
<script type="application/ld+json">with@type: Productor@type: ProductGroup - Meta Tags - Falls back to Open Graph (
og:), Twitter Cards (twitter:), and standard meta tags - HTML Images - Extracts product images directly from
<img>elements using common patterns (Amazon'slandingImage, product image classes, data attributes)
If the first parser returns complete data (name and description), it returns immediately. Otherwise, it merges results from multiple parsers. Images from all three sources are always combined.
- schema.org Product - Standard product markup including
offers,brand,sku,gtin,aggregateRating - schema.org ProductGroup - Product variants (e.g., bol.com) with
hasVariant[] - Open Graph -
og:title,og:description,og:image,product:price:amount,product:price:currency,product:availability,product:condition
Both short ("@type": "Product") and full URL ("@type": "http://schema.org/Product") formats are supported for all schema.org types.
When JSON-LD is unavailable, the parser tries multiple sources:
- name:
og:title→twitter:title→<title> - description:
og:description→twitter:description→<meta name="description"> - image:
og:image→twitter:image→ HTML image elements - url:
<link rel="canonical">→og:url
For sites without structured data or meta tags (e.g., Amazon), the package extracts images directly from HTML:
- Amazon pattern:
<img id="landingImage">withdata-old-hiresfor high-res images - Common IDs:
main-image,product-image,hero-image - Common classes:
product-image,main-image,gallery-image - Data attributes:
data-zoom-image,data-large-image,data-src
High-resolution images are prioritized when available.
composer testPlease see CHANGELOG for more information on what has changed recently.
Please see CONTRIBUTING for details.
Please review our security policy on how to report security vulnerabilities.
The MIT License (MIT). Please see License File for more information.