Skip to content

This repository contains scripts to build and process a Real-Time GDP (RTD) dataset for Peru, focusing on extracting, cleaning, and analyzing GDP revisions from the BCRP's Weekly Reports. Future updates will include econometric models and visualizations.

License

Notifications You must be signed in to change notification settings

JasonCruz18/peruvian_gdp_revisions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,270 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Peru GDP Real-Time Dataset

Automated pipeline for building Peruvian GDP real-time datasets from BCRP Weekly Reports

Python Code style: black License: MIT

Overview

This project builds Real-Time Datasets (RTD) of Peruvian GDP revisions from the Central Reserve Bank of Peru (BCRP) Weekly Reports. The pipeline downloads PDFs, shortens them to key tables, cleans and structures the data, and produces vintage and release datasets for analysis.

Key features:

  • Automated BCRP PDF download with record-based idempotency
  • Shortened PDFs with key GDP tables only
  • Table extraction and cleaning for old (CSV) and new (PDF) sources
  • OCR pipeline for pre-2013 scanned documents (demonstrated on year 2001, see OCR/README.md)
  • Vintage dataset construction and concatenation
  • Base-year and benchmark revision handling
  • Configuration-driven execution with a one-button CLI

Quick Start

Installation

Choose your preferred method (both are one-line simple):

Option A: Conda (Recommended - Includes Java)

# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions

# Create environment with all dependencies
conda env create -f environment.yml

# Activate environment
conda activate peru_gdp_rtd

Option B: Pip + Virtual Environment

# Clone the repository
git clone https://github.com/JasonCruz18/peru_gdp_revisions.git
cd peru_gdp_revisions

# Create and activate virtual environment
python -m venv peru_gdp_rtd
source peru_gdp_rtd/bin/activate  # On Windows: peru_gdp_rtd\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Note: Java (JRE) must be installed separately for PDF processing.

Configuration

# Copy example configuration
cp config/config.example.yaml config/config.yaml

Run Pipeline

# One-button update - runs complete pipeline
python scripts/update_rtd.py

# Run specific steps only
python scripts/update_rtd.py --steps 3,4,5,6

# Skip PDF download (useful for testing)
python scripts/update_rtd.py --skip-download

# Verbose output for debugging
python scripts/update_rtd.py --verbose

Outputs are written to data/output/vintages/ and data/output/releases/. File extensions follow features.persist_format (csv or parquet).


Project Structure

peru_gdp_revisions/
|-- peru_gdp_rtd/
|   |-- config/
|   |   |-- settings.py
|   |   `-- __init__.py
|   |-- scrapers/
|   |   `-- bcrp_scraper.py
|   |-- processors/
|   |   |-- pdf_processor.py
|   |   |-- file_organizer.py
|   |   `-- metadata.py
|   |-- cleaners/
|   |   `-- ...
|   |-- transformers/
|   |   |-- vintage_preparator.py
|   |   |-- concatenator.py
|   |   |-- metadata_handler.py
|   |   `-- releases_converter.py
|   |-- orchestration/
|   |   |-- runners.py
|   |   `-- validation.py
|   `-- utils/
|       |-- data_manager.py
|       |-- alerts.py
|       `-- progress.py
|-- OCR/                       # Standalone OCR pipeline (year 2001 demonstration)
|   |-- ocr_config/
|   |   |-- config.yaml
|   |   `-- settings.py
|   |-- ocr_processors/
|   |   |-- image_preprocessor.py
|   |   |-- table_extractor.py
|   |   |-- ocr_engine.py
|   |   |-- csv_converter.py
|   |   `-- validator.py
|   |-- ocr_utils/
|   |   |-- logger.py
|   |   |-- progress_tracker.py
|   |   `-- file_manager.py
|   |-- output/                # OCR results for year 2001
|   |   `-- table_1/2001/
|   |-- raw/                   # gitignored; scanned PDFs
|   |   `-- 2001/
|   |-- README.md
|   |-- MANUAL_REVIEW_GUIDE.md
|   `-- requirements.txt
|-- config/
|   |-- config.yaml
|   `-- config.example.yaml
|-- scripts/
|   |-- update_rtd.py
|   |-- validate_rtd.py
|   `-- run_ocr_pipeline.py   # OCR pipeline runner
|-- data/                      # gitignored; shown for reference
|   |-- raw/
|   |   |-- new_weekly_reports/
|   |   |   |-- 2013/
|   |   |   |-- ...
|   |   |   |-- shortened_pdfs/
|   |   |   `-- _quarantine/
|   |   `-- old_weekly_reports/  # Manually-curated pre-2013 data
|   |       |-- table_1/
|   |       `-- table_2/
|   |-- input/
|   |   |-- table_1/
|   |   `-- table_2/
|   `-- output/
|       |-- vintages/
|       `-- releases/
|-- metadata/
|   `-- wr_metadata.csv
|-- record/                    # gitignored
|   |-- 1_downloaded_pdfs.txt
|   `-- 2_shortened_pdfs.txt
|-- docs/
|-- notebooks/
|-- tests/
|-- requirements.txt
|-- requirements-dev.txt
`-- README.md

Pipeline Steps

The pipeline consists of 6 sequential steps:

Step 1: Download PDFs

  • Scrapes the BCRP Weekly Reports page
  • Downloads new PDFs to data/raw/new_weekly_reports/
  • Tracks downloads in record/1_downloaded_pdfs.txt
  • Organizes PDFs into year folders

Step 2: Shorten PDFs

  • Extracts key pages with GDP tables
  • Writes shortened PDFs to data/raw/new_weekly_reports/shortened_pdfs/<year>/
  • Tracks processed files in record/2_shortened_pdfs.txt

Step 3: Clean Tables and Build Vintages

  • Extracts and cleans tables from old CSVs and shortened PDFs
  • Creates vintage-format files in data/input/table_1/ and data/input/table_2/

Step 4: Concatenate RTDs

  • Merges vintages across years
  • Outputs RTDs to data/output/vintages/

Step 5: Metadata and Benchmarks

  • Updates metadata/wr_metadata.csv
  • Applies base-year sentinel adjustments
  • Generates benchmark datasets in data/output/vintages/

Step 6: Convert to Releases

  • Converts vintages to release-format datasets
  • Outputs to data/output/releases/

Output Datasets

All outputs are written to data/output/ with extension based on features.persist_format.

Vintage datasets (data/output/vintages/)

  • monthly_gdp_vintages.<ext>
  • quarterly_gdp_vintages.<ext>
  • monthly_gdp_vintages_adjusted.<ext>
  • quarterly_gdp_vintages_adjusted.<ext>
  • monthly_gdp_vintages_benchmark.<ext>
  • quarterly_gdp_vintages_benchmark.<ext>
  • monthly_gdp_vintages_adjusted_benchmark.<ext>
  • quarterly_gdp_vintages_adjusted_benchmark.<ext>

Releases datasets (data/output/releases/)

  • monthly_gdp_releases.<ext>
  • quarterly_gdp_releases.<ext>
  • monthly_gdp_releases_adjusted.<ext>
  • quarterly_gdp_releases_adjusted.<ext>
  • monthly_gdp_releases_benchmark.<ext>
  • quarterly_gdp_releases_benchmark.<ext>
  • monthly_gdp_releases_adjusted_benchmark.<ext>
  • quarterly_gdp_releases_adjusted_benchmark.<ext>

Usage Examples

Command-Line Interface

# Run complete pipeline
python scripts/update_rtd.py

# Run steps 3-6 only
python scripts/update_rtd.py --steps 3,4,5,6

# Use custom configuration
python scripts/update_rtd.py --config path/to/custom_config.yaml

# Dry run (see what would be executed)
python scripts/update_rtd.py --dry-run

Validate Outputs

python scripts/validate_rtd.py

Jupyter Notebooks

Open notebooks/new_gdp_rtd.ipynb for a full walkthrough with explanations and examples.


Configuration

Key settings in config/config.yaml:

scraper:
  browser: "chrome"     # chrome, firefox, edge
  headless: false
  max_downloads: 60

features:
  enable_alerts: false
  persist_format: "parquet"   # csv or parquet
  validate_data: true

record_files:
  downloaded_pdfs: "1_downloaded_pdfs.txt"
  shortened_pdfs: "2_shortened_pdfs.txt"

See config/config.example.yaml for all options.


Requirements

  • Python 3.10 or higher
  • Java Runtime Environment (JRE) for tabula-py
  • Chrome, Firefox, or Edge for Selenium web scraping

Data Sources

Main source: BCRP Weekly Reports

The pipeline processes two types of data:

  • New data (2013+): digital PDFs with editable tables
  • Old data (pre-2013): scanned PDFs converted to CSV

Development

Install Development Dependencies

pip install -r requirements-dev.txt

Code Formatting

black peru_gdp_rtd/
isort peru_gdp_rtd/
flake8 peru_gdp_rtd/

Running Tests

pytest tests/

Documentation


Research Context

This project supports the research paper:

"Rationality and Nowcasting on Peruvian GDP Revisions" by Jason Cruz, Diego Winkelried, and Javier Torres (Universidad del Pacifico - CIUP)

The datasets generated by this pipeline enable analysis of:

  • GDP revision patterns in emerging markets
  • Nowcasting accuracy using real-time data
  • Information content of preliminary releases

Contributing

Contributions are welcome. Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Format code with Black (black .)
  4. Run tests (pytest)
  5. Commit changes (git commit -m 'Add amazing feature')
  6. Push to branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Citation

For the Research Paper & Dataset

If you use this dataset or research in your work, please cite:

@article{cruz_etal_2025,
  title={Rationality and Nowcasting on Peruvian GDP Revisions},
  author={Cruz, Jason and Winkelried, Diego and Torres, Javier},
  year={2025},
  journal={Data in Brief},
  institution={Universidad del Pacifico - CIUP}
}

For the Code/Software

If you use this code repository or pipeline, please cite:

@software{cruz2024pipeline,
  title={Peru GDP Real-Time Dataset Construction Pipeline},
  author={Cruz, Jason},
  year={2024},
  url={https://github.com/JasonCruz18/peru_gdp_revisions},
  institution={Universidad del Pacifico - CIUP}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Central Reserve Bank of Peru (BCRP) for public access to Weekly Reports
  • Universidad del Pacifico - CIUP for research support

Contact

Jason Cruz Email: jj.cruza@up.edu.pe GitHub: @JasonCruz18


Support

For issues, questions, or contributions:

About

This repository contains scripts to build and process a Real-Time GDP (RTD) dataset for Peru, focusing on extracting, cleaning, and analyzing GDP revisions from the BCRP's Weekly Reports. Future updates will include econometric models and visualizations.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •