IRCC XFA Form Extractor

A Python command-line tool to extract filled data from Canadian immigration (IRCC) XFA PDF forms.

Features

🔍 Smart Extraction: Extracts only filled data, ignoring form templates and options
📄 Multiple Formats: Supports all IRCC XFA forms (IMM5257, IMM5710, etc.)
💾 JSON Output: Saves extracted data in structured JSON format
🚀 Easy to Use: Simple command-line interface
🔧 Flexible: Process single or multiple files at once

Installation

Method 1: Install from Source (Recommended)

# Clone the repository
git clone https://github.com/adelrzouga/ircc-xfa-extractor.git
cd ircc-xfa-extractor

# Install the package
pip install -e .

Method 2: Direct Installation

pip install git+https://github.com/adelrzouga/ircc-xfa-extractor.git

Requirements

Python 3.8 or higher
pikepdf (automatically installed with the package)

Usage

Basic Usage

Extract data from a single PDF:

ircc /path/to/form.pdf

This will create a JSON file named form_filled.json in the same directory as the PDF.

Advanced Usage

# Process multiple files
ircc /path/to/imm5257.pdf /path/to/imm5710.pdf

# Specify output directory
ircc /path/to/form.pdf -o /output/directory

# Create combined output file
ircc /path/to/*.pdf --combined

# Compact JSON format
ircc /path/to/form.pdf -f json

# Verbose output with details
ircc /path/to/form.pdf --verbose

Command-Line Options

Option	Description
`files`	One or more PDF files to process (required)
`-o, --output DIR`	Output directory for JSON files (default: same as PDF)
`-f, --format {json,pretty}`	Output format: compact or indented (default: pretty)
`-c, --combined`	Create a combined JSON file with all extracted data
`-v, --verbose`	Show detailed extraction information
`--version`	Show version information

Examples

Example 1: Process Immigration Application

# Extract data from application form
ircc ~/Downloads/imm5257.pdf

# Output: ~/Downloads/imm5257_filled.json

Example 2: Process Multiple Forms

# Extract from all PDF forms in a directory
ircc ~/Documents/immigration/*.pdf -o ~/Documents/extracted --combined

# Creates individual JSON files for each PDF plus a combined file

Example 3: Detailed Extraction

# Get detailed output about what's being extracted
ircc application.pdf --verbose

Output:

================================================================================
IRCC XFA Form Extractor v1.0.0
================================================================================

📄 Processing: imm5257.pdf
  ✓ Extracted 25 fields
  ✓ Saved to: imm5257_filled.json

  Sample extracted data:
    • Schedule1.FamilyName: CHAFROUD EP BAHRI
    • Schedule1.GivenName: RIM
    • Schedule1.ApplicantBirthDate.Day: 08
    • Schedule1.ApplicantBirthDate.Month: 01
    • Schedule1.ApplicantBirthDate.Year: 1984
    ... and 20 more fields

================================================================================
Summary: 1 succeeded, 0 failed
================================================================================

Supported Forms

This tool works with any IRCC XFA-based PDF form, including but not limited to:

IMM 5257 - Application for Temporary Resident Visa
IMM 5710 - Application to Change Conditions, Extend Stay or Remain in Canada
IMM 5406 - Additional Family Information
IMM 5476 - Use of a Representative
And many more...

Output Format

The tool extracts data into clean, structured JSON:

{
  "Schedule1.FamilyName": "CHAFROUD EP BAHRI",
  "Schedule1.GivenName": "RIM",
  "Schedule1.ApplicantBirthDate.Day": "08",
  "Schedule1.ApplicantBirthDate.Month": "01",
  "Schedule1.ApplicantBirthDate.Year": "1984",
  "Schedule1.UCI": "1129825035",
  "Page1.PersonalDetails.Name.FamilyName": "CHAFROUD EP BAHRI",
  "Page2.ContactInformation.q5-6.Email.Email": "rim.chafroud@gmail.com",
  "Page2.Passport.PassportNum": "J454097"
}

How It Works

Opens PDF: Uses pikepdf to read the PDF structure
Locates XFA Data: Finds the XFA datasets section containing form data
Parses XML: Extracts the XML structure from XFA streams
Filters Data: Intelligently distinguishes between:
- Actual filled values (names, dates, addresses)
- Form template options (dropdown choices, checkboxes)
Saves JSON: Outputs clean, structured data

Troubleshooting

"PDF does not contain XFA data"

Some PDF forms may not use XFA format. This tool specifically works with XFA-based forms. Most IRCC forms from the official website are XFA-based.

"No filled data found"

This means the PDF doesn't have any filled fields, or the form hasn't been completed yet. Try opening the PDF and verifying that fields are actually filled in.

Permission Errors

Make sure you have read access to the PDF file and write access to the output directory.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/adelrzouga/ircc-xfa-extractor.git
cd ircc-xfa-extractor

# Install in development mode with dev dependencies
pip install -e ".[dev]"

Running Tests

pytest

Code Formatting

black ircc_xfa/

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with pikepdf for PDF processing
Designed for Canadian immigration (IRCC) XFA forms

Disclaimer

This tool is for personal use and data extraction purposes. Always verify extracted data against the original PDF forms. This is not an official IRCC tool and is not affiliated with Immigration, Refugees and Citizenship Canada.

Author

Adel Rzouga

GitHub: @adelrzouga

Support

If you encounter any issues or have questions:

Open an issue on GitHub
Check existing issues for solutions

Made with ❤️ for the Canadian immigration community

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
ircc_xfa		ircc_xfa
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
GITHUB_SETUP.md		GITHUB_SETUP.md
LICENSE		LICENSE
QUICK_START.md		QUICK_START.md
README.md		README.md
USAGE.md		USAGE.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IRCC XFA Form Extractor

Features

Installation

Method 1: Install from Source (Recommended)

Method 2: Direct Installation

Requirements

Usage

Basic Usage

Advanced Usage

Command-Line Options

Examples

Example 1: Process Immigration Application

Example 2: Process Multiple Forms

Example 3: Detailed Extraction

Supported Forms

Output Format

How It Works

Troubleshooting

"PDF does not contain XFA data"

"No filled data found"

Permission Errors

Development

Setting Up Development Environment

Running Tests

Code Formatting

Contributing

License

Acknowledgments

Disclaimer

Author

Support

About

Uh oh!

Releases

Packages

Languages

License

adelrzouga/ircc-xfa-extractor

Folders and files

Latest commit

History

Repository files navigation

IRCC XFA Form Extractor

Features

Installation

Method 1: Install from Source (Recommended)

Method 2: Direct Installation

Requirements

Usage

Basic Usage

Advanced Usage

Command-Line Options

Examples

Example 1: Process Immigration Application

Example 2: Process Multiple Forms

Example 3: Detailed Extraction

Supported Forms

Output Format

How It Works

Troubleshooting

"PDF does not contain XFA data"

"No filled data found"

Permission Errors

Development

Setting Up Development Environment

Running Tests

Code Formatting

Contributing

License

Acknowledgments

Disclaimer

Author

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages