Automated pipeline for building Peruvian GDP real-time datasets from BCRP Weekly Reports
This project builds Real-Time Datasets (RTD) of Peruvian GDP revisions from the Central Reserve Bank of Peru (BCRP) Weekly Reports. The pipeline downloads PDFs, shortens them to key tables, cleans and structures the data, and produces vintage and release datasets for analysis.
Key features:
- Automated BCRP PDF download with record-based idempotency
- Shortened PDFs with key GDP tables only
- Table extraction and cleaning for old (CSV) and new (PDF) sources
- OCR pipeline for pre-2013 scanned documents (demonstrated on year 2001, see
OCR/README.md) - Vintage dataset construction and concatenation
- Base-year and benchmark revision handling
- Configuration-driven execution with a one-button CLI
Choose your preferred method (both are one-line simple):
# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions
# Create environment with all dependencies
conda env create -f environment.yml
# Activate environment
conda activate peru_gdp_rtd# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions
# Create and activate virtual environment
python -m venv peru_gdp_rtd
source peru_gdp_rtd/bin/activate # On Windows: peru_gdp_rtd\Scripts\activate
# Install dependencies
pip install -r requirements.txtNote: Java (JRE) must be installed separately for PDF processing.
# Copy example configuration
cp config/config.example.yaml config/config.yaml# One-button update - runs complete pipeline
python scripts/update_rtd.py
# Run specific steps only
python scripts/update_rtd.py --steps 3,4,5,6
# Skip PDF download (useful for testing)
python scripts/update_rtd.py --skip-download
# Verbose output for debugging
python scripts/update_rtd.py --verboseOutputs are written to data/output/vintages/ and data/output/releases/. File extensions follow features.persist_format (csv or parquet).
peru_gdp_revisions/
|-- peru_gdp_rtd/
| |-- config/
| | |-- settings.py
| | `-- __init__.py
| |-- scrapers/
| | `-- bcrp_scraper.py
| |-- processors/
| | |-- pdf_processor.py
| | |-- file_organizer.py
| | `-- metadata.py
| |-- cleaners/
| | `-- ...
| |-- transformers/
| | |-- vintage_preparator.py
| | |-- concatenator.py
| | |-- metadata_handler.py
| | `-- releases_converter.py
| |-- orchestration/
| | |-- runners.py
| | `-- validation.py
| `-- utils/
| |-- data_manager.py
| |-- alerts.py
| `-- progress.py
|-- OCR/ # Standalone OCR pipeline (year 2001 demonstration)
| |-- ocr_config/
| | |-- config.yaml
| | `-- settings.py
| |-- ocr_processors/
| | |-- image_preprocessor.py
| | |-- table_extractor.py
| | |-- ocr_engine.py
| | |-- csv_converter.py
| | `-- validator.py
| |-- ocr_utils/
| | |-- logger.py
| | |-- progress_tracker.py
| | `-- file_manager.py
| |-- output/ # OCR results for year 2001
| | `-- table_1/2001/
| |-- raw/ # gitignored; scanned PDFs
| | `-- 2001/
| |-- README.md
| |-- MANUAL_REVIEW_GUIDE.md
| `-- requirements.txt
|-- config/
| |-- config.yaml
| `-- config.example.yaml
|-- scripts/
| |-- update_rtd.py
| |-- validate_rtd.py
| `-- run_ocr_pipeline.py # OCR pipeline runner
|-- data/ # gitignored; shown for reference
| |-- raw/
| | |-- new_weekly_reports/
| | | |-- 2013/
| | | |-- ...
| | | |-- shortened_pdfs/
| | | `-- _quarantine/
| | `-- old_weekly_reports/ # Manually-curated pre-2013 data
| | |-- table_1/
| | `-- table_2/
| |-- input/
| | |-- table_1/
| | `-- table_2/
| `-- output/
| |-- vintages/
| `-- releases/
|-- metadata/
| `-- wr_metadata.csv
|-- record/ # gitignored
| |-- 1_downloaded_pdfs.txt
| `-- 2_shortened_pdfs.txt
|-- docs/
|-- notebooks/
|-- tests/
|-- requirements.txt
|-- requirements-dev.txt
`-- README.md
The pipeline consists of 6 sequential steps:
- Scrapes the BCRP Weekly Reports page
- Downloads new PDFs to
data/raw/new_weekly_reports/ - Tracks downloads in
record/1_downloaded_pdfs.txt - Organizes PDFs into year folders
- Extracts key pages with GDP tables
- Writes shortened PDFs to
data/raw/new_weekly_reports/shortened_pdfs/<year>/ - Tracks processed files in
record/2_shortened_pdfs.txt
- Extracts and cleans tables from old CSVs and shortened PDFs
- Creates vintage-format files in
data/input/table_1/anddata/input/table_2/
- Merges vintages across years
- Outputs RTDs to
data/output/vintages/
- Updates
metadata/wr_metadata.csv - Applies base-year sentinel adjustments
- Generates benchmark datasets in
data/output/vintages/
- Converts vintages to release-format datasets
- Outputs to
data/output/releases/
All outputs are written to data/output/ with extension based on features.persist_format.
monthly_gdp_vintages.<ext>quarterly_gdp_vintages.<ext>monthly_gdp_vintages_adjusted.<ext>quarterly_gdp_vintages_adjusted.<ext>monthly_gdp_vintages_benchmark.<ext>quarterly_gdp_vintages_benchmark.<ext>monthly_gdp_vintages_adjusted_benchmark.<ext>quarterly_gdp_vintages_adjusted_benchmark.<ext>
monthly_gdp_releases.<ext>quarterly_gdp_releases.<ext>monthly_gdp_releases_adjusted.<ext>quarterly_gdp_releases_adjusted.<ext>monthly_gdp_releases_benchmark.<ext>quarterly_gdp_releases_benchmark.<ext>monthly_gdp_releases_adjusted_benchmark.<ext>quarterly_gdp_releases_adjusted_benchmark.<ext>
# Run complete pipeline
python scripts/update_rtd.py
# Run steps 3-6 only
python scripts/update_rtd.py --steps 3,4,5,6
# Use custom configuration
python scripts/update_rtd.py --config path/to/custom_config.yaml
# Dry run (see what would be executed)
python scripts/update_rtd.py --dry-runpython scripts/validate_rtd.pyOpen notebooks/new_gdp_rtd.ipynb for a full walkthrough with explanations and examples.
Key settings in config/config.yaml:
scraper:
browser: "chrome" # chrome, firefox, edge
headless: false
max_downloads: 60
features:
enable_alerts: false
persist_format: "parquet" # csv or parquet
validate_data: true
record_files:
downloaded_pdfs: "1_downloaded_pdfs.txt"
shortened_pdfs: "2_shortened_pdfs.txt"See config/config.example.yaml for all options.
- Python 3.10 or higher
- Java Runtime Environment (JRE) for tabula-py
- Chrome, Firefox, or Edge for Selenium web scraping
Main source: BCRP Weekly Reports
The pipeline processes two types of data:
- New data (2013+): digital PDFs with editable tables
- Old data (pre-2013): scanned PDFs converted to CSV
pip install -r requirements-dev.txtblack peru_gdp_rtd/
isort peru_gdp_rtd/
flake8 peru_gdp_rtd/pytest tests/- Installation Guide: docs/INSTALLATION.md
- Usage Guide: docs/USAGE.md
- Architecture: docs/ARCHITECTURE.md
- Data Availability: docs/DATA_AVAILABILITY.md
- FAQ: FAQ.md
This project supports the research paper:
"Rationality and Nowcasting on Peruvian GDP Revisions" by Jason Cruz, Diego Winkelried, and Javier Torres (Universidad del Pacifico - CIUP)
The datasets generated by this pipeline enable analysis of:
- GDP revision patterns in emerging markets
- Nowcasting accuracy using real-time data
- Information content of preliminary releases
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Format code with Black (
black .) - Run tests (
pytest) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
If you use this dataset or research in your work, please cite:
@article{cruz_etal_2025,
title={Rationality and Nowcasting on Peruvian GDP Revisions},
author={Cruz, Jason and Winkelried, Diego and Torres, Javier},
year={2025},
journal={Data in Brief},
institution={Universidad del Pacifico - CIUP}
}If you use this code repository or pipeline, please cite:
@software{cruz2024pipeline,
title={Peru GDP Real-Time Dataset Construction Pipeline},
author={Cruz, Jason},
year={2024},
url={https://github.com/JasonCruz18/peru_gdp_revisions},
institution={Universidad del Pacifico - CIUP}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Central Reserve Bank of Peru (BCRP) for public access to Weekly Reports
- Universidad del Pacifico - CIUP for research support
Jason Cruz Email: jj.cruza@up.edu.pe GitHub: @JasonCruz18
For issues, questions, or contributions:
- GitHub Issues: https://github.com/JasonCruz18/peru_gdp_revisions/issues
- Email: jj.cruza@up.edu.pe
- Documentation: FAQ.md and docs/