Peru GDP Real-Time Dataset

Automated pipeline for building Peruvian GDP real-time datasets from BCRP Weekly Reports

Overview

This project builds Real-Time Datasets (RTD) of Peruvian GDP revisions from the Central Reserve Bank of Peru (BCRP) Weekly Reports. The pipeline downloads PDFs, shortens them to key tables, cleans and structures the data, and produces vintage and release datasets for analysis.

Key features:

Automated BCRP PDF download with record-based idempotency
Shortened PDFs with key GDP tables only
Table extraction and cleaning for old (CSV) and new (PDF) sources
OCR pipeline for pre-2013 scanned documents (demonstrated on year 2001, see OCR/README.md)
Vintage dataset construction and concatenation
Base-year and benchmark revision handling
Configuration-driven execution with a one-button CLI

Quick Start

Installation

Choose your preferred method (both are one-line simple):

Option A: Conda (Recommended - Includes Java)

# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions

# Create environment with all dependencies
conda env create -f environment.yml

# Activate environment
conda activate peru_gdp_rtd

Option B: Pip + Virtual Environment

# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions

# Create and activate virtual environment
python -m venv peru_gdp_rtd
source peru_gdp_rtd/bin/activate  # On Windows: peru_gdp_rtd\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Note: Java (JRE) must be installed separately for PDF processing.

Configuration

# Copy example configuration
cp config/config.example.yaml config/config.yaml

Run Pipeline

# One-button update - runs complete pipeline
python scripts/update_rtd.py

# Run specific steps only
python scripts/update_rtd.py --steps 3,4,5,6

# Skip PDF download (useful for testing)
python scripts/update_rtd.py --skip-download

# Verbose output for debugging
python scripts/update_rtd.py --verbose

Outputs are written to data/output/vintages/ and data/output/releases/. File extensions follow features.persist_format (csv or parquet).

Project Structure

peru_gdp_revisions/
|-- peru_gdp_rtd/
|   |-- config/
|   |   |-- settings.py
|   |   `-- __init__.py
|   |-- scrapers/
|   |   `-- bcrp_scraper.py
|   |-- processors/
|   |   |-- pdf_processor.py
|   |   |-- file_organizer.py
|   |   `-- metadata.py
|   |-- cleaners/
|   |   `-- ...
|   |-- transformers/
|   |   |-- vintage_preparator.py
|   |   |-- concatenator.py
|   |   |-- metadata_handler.py
|   |   `-- releases_converter.py
|   |-- orchestration/
|   |   |-- runners.py
|   |   `-- validation.py
|   `-- utils/
|       |-- data_manager.py
|       |-- alerts.py
|       `-- progress.py
|-- OCR/                       # Standalone OCR pipeline (year 2001 demonstration)
|   |-- ocr_config/
|   |   |-- config.yaml
|   |   `-- settings.py
|   |-- ocr_processors/
|   |   |-- image_preprocessor.py
|   |   |-- table_extractor.py
|   |   |-- ocr_engine.py
|   |   |-- csv_converter.py
|   |   `-- validator.py
|   |-- ocr_utils/
|   |   |-- logger.py
|   |   |-- progress_tracker.py
|   |   `-- file_manager.py
|   |-- output/                # OCR results for year 2001
|   |   `-- table_1/2001/
|   |-- raw/                   # gitignored; scanned PDFs
|   |   `-- 2001/
|   |-- README.md
|   |-- MANUAL_REVIEW_GUIDE.md
|   `-- requirements.txt
|-- config/
|   |-- config.yaml
|   `-- config.example.yaml
|-- scripts/
|   |-- update_rtd.py
|   |-- validate_rtd.py
|   `-- run_ocr_pipeline.py   # OCR pipeline runner
|-- data/                      # gitignored; shown for reference
|   |-- raw/
|   |   |-- new_weekly_reports/
|   |   |   |-- 2013/
|   |   |   |-- ...
|   |   |   |-- shortened_pdfs/
|   |   |   `-- _quarantine/
|   |   `-- old_weekly_reports/  # Manually-curated pre-2013 data
|   |       |-- table_1/
|   |       `-- table_2/
|   |-- input/
|   |   |-- table_1/
|   |   `-- table_2/
|   `-- output/
|       |-- vintages/
|       `-- releases/
|-- metadata/
|   `-- wr_metadata.csv
|-- record/                    # gitignored
|   |-- 1_downloaded_pdfs.txt
|   `-- 2_shortened_pdfs.txt
|-- docs/
|-- notebooks/
|-- tests/
|-- requirements.txt
|-- requirements-dev.txt
`-- README.md

Pipeline Steps

The pipeline consists of 6 sequential steps:

Step 1: Download PDFs

Scrapes the BCRP Weekly Reports page
Downloads new PDFs to data/raw/new_weekly_reports/
Tracks downloads in record/1_downloaded_pdfs.txt
Organizes PDFs into year folders

Step 2: Shorten PDFs

Extracts key pages with GDP tables
Writes shortened PDFs to data/raw/new_weekly_reports/shortened_pdfs/<year>/
Tracks processed files in record/2_shortened_pdfs.txt

Step 3: Clean Tables and Build Vintages

Extracts and cleans tables from old CSVs and shortened PDFs
Creates vintage-format files in data/input/table_1/ and data/input/table_2/

Step 4: Concatenate RTDs

Merges vintages across years
Outputs RTDs to data/output/vintages/

Step 5: Metadata and Benchmarks

Updates metadata/wr_metadata.csv
Applies base-year sentinel adjustments
Generates benchmark datasets in data/output/vintages/

Step 6: Convert to Releases

Converts vintages to release-format datasets
Outputs to data/output/releases/

Output Datasets

All outputs are written to data/output/ with extension based on features.persist_format.

Vintage datasets (`data/output/vintages/`)

monthly_gdp_vintages.<ext>
quarterly_gdp_vintages.<ext>
monthly_gdp_vintages_adjusted.<ext>
quarterly_gdp_vintages_adjusted.<ext>
monthly_gdp_vintages_benchmark.<ext>
quarterly_gdp_vintages_benchmark.<ext>
monthly_gdp_vintages_adjusted_benchmark.<ext>
quarterly_gdp_vintages_adjusted_benchmark.<ext>

Releases datasets (`data/output/releases/`)

monthly_gdp_releases.<ext>
quarterly_gdp_releases.<ext>
monthly_gdp_releases_adjusted.<ext>
quarterly_gdp_releases_adjusted.<ext>
monthly_gdp_releases_benchmark.<ext>
quarterly_gdp_releases_benchmark.<ext>
monthly_gdp_releases_adjusted_benchmark.<ext>
quarterly_gdp_releases_adjusted_benchmark.<ext>

Usage Examples

Command-Line Interface

# Run complete pipeline
python scripts/update_rtd.py

# Run steps 3-6 only
python scripts/update_rtd.py --steps 3,4,5,6

# Use custom configuration
python scripts/update_rtd.py --config path/to/custom_config.yaml

# Dry run (see what would be executed)
python scripts/update_rtd.py --dry-run

Validate Outputs

python scripts/validate_rtd.py

Jupyter Notebooks

Open notebooks/new_gdp_rtd.ipynb for a full walkthrough with explanations and examples.

Configuration

Key settings in config/config.yaml:

scraper:
  browser: "chrome"     # chrome, firefox, edge
  headless: false
  max_downloads: 60

features:
  enable_alerts: false
  persist_format: "parquet"   # csv or parquet
  validate_data: true

record_files:
  downloaded_pdfs: "1_downloaded_pdfs.txt"
  shortened_pdfs: "2_shortened_pdfs.txt"

See config/config.example.yaml for all options.

Requirements

Python 3.10 or higher
Java Runtime Environment (JRE) for tabula-py
Chrome, Firefox, or Edge for Selenium web scraping

Data Sources

Main source: BCRP Weekly Reports

The pipeline processes two types of data:

New data (2013+): digital PDFs with editable tables
Old data (pre-2013): scanned PDFs converted to CSV

Development

Install Development Dependencies

pip install -r requirements-dev.txt

Code Formatting

black peru_gdp_rtd/
isort peru_gdp_rtd/
flake8 peru_gdp_rtd/

Running Tests

pytest tests/

Documentation

Installation Guide: docs/INSTALLATION.md
Usage Guide: docs/USAGE.md
Architecture: docs/ARCHITECTURE.md
Data Availability: docs/DATA_AVAILABILITY.md
FAQ: FAQ.md

Research Context

This project supports the research paper:

"Rationality and Nowcasting on Peruvian GDP Revisions" by Jason Cruz, Diego Winkelried, and Javier Torres (Universidad del Pacifico - CIUP)

The datasets generated by this pipeline enable analysis of:

GDP revision patterns in emerging markets
Nowcasting accuracy using real-time data
Information content of preliminary releases

Contributing

Contributions are welcome. Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Format code with Black (black .)
Run tests (pytest)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

For the Research Paper & Dataset

If you use this dataset or research in your work, please cite:

@article{cruz_etal_2025,
  title={Rationality and Nowcasting on Peruvian GDP Revisions},
  author={Cruz, Jason and Winkelried, Diego and Torres, Javier},
  year={2025},
  journal={Data in Brief},
  institution={Universidad del Pacifico - CIUP}
}

For the Code/Software

If you use this code repository or pipeline, please cite:

@software{cruz2024pipeline,
  title={Peru GDP Real-Time Dataset Construction Pipeline},
  author={Cruz, Jason},
  year={2024},
  url={https://github.com/JasonCruz18/peru_gdp_revisions},
  institution={Universidad del Pacifico - CIUP}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Central Reserve Bank of Peru (BCRP) for public access to Weekly Reports
Universidad del Pacifico - CIUP for research support

Contact

Jason Cruz Email: jj.cruza@up.edu.pe GitHub: @JasonCruz18

Support

For issues, questions, or contributions:

GitHub Issues: https://github.com/JasonCruz18/peru_gdp_revisions/issues
Email: jj.cruza@up.edu.pe
Documentation: FAQ.md and docs/

Name		Name	Last commit message	Last commit date
Latest commit History 1,270 Commits
.github/workflows		.github/workflows
DIB		DIB
OCR		OCR
assets		assets
config		config
dashboard		dashboard
docs		docs
metadata		metadata
notebooks		notebooks
peru_gdp_rtd		peru_gdp_rtd
scripts		scripts
tests		tests
.gitignore		.gitignore
ACTION_PLAN_NEXT_STEPS.md		ACTION_PLAN_NEXT_STEPS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CITATION_CFF_TEMPLATE.md		CITATION_CFF_TEMPLATE.md
DIB_MANUSCRIPT_DRAFT.md		DIB_MANUSCRIPT_DRAFT.md
DIB_SUBMISSION_CHECKLIST.md		DIB_SUBMISSION_CHECKLIST.md
DOI_UPDATE_CHECKLIST.md		DOI_UPDATE_CHECKLIST.md
FAQ.md		FAQ.md
FINAL_STATUS_SUMMARY.md		FINAL_STATUS_SUMMARY.md
GITHUB_RELEASE_GUIDE.md		GITHUB_RELEASE_GUIDE.md
GITHUB_RELEASE_v1.0.0_NOTES.md		GITHUB_RELEASE_v1.0.0_NOTES.md
LICENSE		LICENSE
ORCID_SETUP_GUIDE.md		ORCID_SETUP_GUIDE.md
PROGRESS_SUMMARY.md		PROGRESS_SUMMARY.md
Pipeline		Pipeline
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
READY_FOR_ZENODO.md		READY_FOR_ZENODO.md
SUMMARY_SESSION.md		SUMMARY_SESSION.md
ZENODO_README.md		ZENODO_README.md
ZENODO_UPLOAD_GUIDE.md		ZENODO_UPLOAD_GUIDE.md
_Supplement.tex		_Supplement.tex
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-frozen.txt		requirements-frozen.txt
requirements.txt		requirements.txt

License

JasonCruz18/peruvian_gdp_revisions

Folders and files

Latest commit

History

Repository files navigation

Peru GDP Real-Time Dataset

Overview

Quick Start

Installation

Option A: Conda (Recommended - Includes Java)

Option B: Pip + Virtual Environment

Configuration

Run Pipeline

Project Structure

Pipeline Steps

Step 1: Download PDFs

Step 2: Shorten PDFs

Step 3: Clean Tables and Build Vintages

Step 4: Concatenate RTDs

Step 5: Metadata and Benchmarks

Step 6: Convert to Releases

Output Datasets

Vintage datasets (data/output/vintages/)

Releases datasets (data/output/releases/)

Usage Examples

Command-Line Interface

Validate Outputs

Jupyter Notebooks

Configuration

Requirements

Data Sources

Development

Install Development Dependencies

Code Formatting

Running Tests

Documentation

Research Context

Contributing

Citation

For the Research Paper & Dataset

For the Code/Software

License

Acknowledgments

Contact

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Vintage datasets (`data/output/vintages/`)

Releases datasets (`data/output/releases/`)

Packages