A focused tool for collecting structured product data from the Massimo Dutti website across countries and languages. It helps teams turn complex product pages into clean, usable datasets for analysis, monitoring, and automation. Built for reliability, speed, and clarity around real-world product information.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for massimo-dutti you've just found your team β Letβs Chat. ππ
This project extracts detailed product information from Massimo Duttiβs online catalog and organizes it into a consistent, machine-readable format. It removes the manual effort of browsing categories, variants, and product pages one by one. Itβs designed for developers, analysts, and e-commerce teams who need accurate fashion product data at scale.
- Handles full site, category-level, or single product extraction
- Normalizes color, size, and image variants under one product
- Works across different country storefronts and languages
- Produces structured data suitable for JSON, CSV, or analytics pipelines
| Feature | Description |
|---|---|
| Full catalog scraping | Collects products from the entire website or selected sections |
| Category targeting | Scrape specific product categories with controlled depth |
| Product-level detail | Extracts rich attributes from individual product pages |
| Variant aggregation | Groups colors and sizes into a single product record |
| Deduplication logic | Reduces duplicate products across overlapping categories |
| Structured output | Returns clean, nested data ready for further processing |
| Field Name | Field Description |
|---|---|
| id | Unique product identifier |
| name | Product name |
| description | Short product description |
| reference | Internal product reference code |
| price | Current product price |
| oldPrice | Previous price if discounted |
| colors | Available color names |
| sizes | Available size labels |
| category | Product category path |
| images | Product image URLs |
| availabilityDate | First availability timestamp |
| composition | Material composition details |
| care | Care and washing instructions |
| sustainability | Sustainability-related attributes |
| traceability | Production and sourcing countries |
| productPage | Direct product page URL |
[
{
"id": 46503392,
"name": "Russet cotton jacket with pocket details",
"reference": "06736991-V2025",
"price": 14900,
"colors": "Red, Russet",
"sizes": "10, 12, 14",
"category": "women/jackets-n1450",
"productPage": "https://www.massimodutti.com/gb/russet-cotton-jacket-with-pocket-details-l06736991",
"composition": "100% cotton",
"availabilityDate": "2025-01-13"
}
]
Massimo Dutti/
βββ src/
β βββ runner.py
β βββ extractors/
β β βββ product_parser.py
β β βββ category_parser.py
β β βββ utils_normalize.py
β βββ outputs/
β β βββ json_writer.py
β β βββ csv_writer.py
β βββ config/
β βββ settings.example.json
βββ data/
β βββ sample_input.json
β βββ sample_output.json
βββ requirements.txt
βββ README.md
- E-commerce analysts use it to monitor product availability and pricing, so they can track assortment changes over time.
- Market researchers use it to study fashion trends, so they can analyze materials, colors, and categories at scale.
- Developers use it to feed product data into internal tools, so they can automate catalog updates.
- Retail teams use it to audit online listings, so they can ensure consistency across regions.
Can I scrape only one product or category? Yes. You can target a single product page, a category page, or the entire site depending on your input configuration.
How are color and size variants handled? All variants are grouped under one product record, making it easier to work with complete product bundles instead of fragmented listings.
Why might the result count be lower than expected? Some pages temporarily expose placeholder products with incomplete data. These are filtered out to maintain data quality.
Is the output suitable for spreadsheets? Yes. Key fields are flattened for easy export, while detailed variant data remains available in structured form.
Primary Metric: Processes roughly 1,000 products in about 5 minutes under normal conditions.
Reliability Metric: High completion rate with stable retries when temporary access blocks occur.
Efficiency Metric: Optimized data transfer minimizes bandwidth usage and runtime costs.
Quality Metric: Returns clean, deduplicated product records with consistent field naming and structure.
