Skip to content

Fix scraper validation rules for allowed keys and remote field format#653

Merged
lalalaurentiu merged 7 commits intomainfrom
copilot/fix-99100e8d-7416-46dd-9c73-c28cfae6ee4a
Oct 2, 2025
Merged

Fix scraper validation rules for allowed keys and remote field format#653
lalalaurentiu merged 7 commits intomainfrom
copilot/fix-99100e8d-7416-46dd-9c73-c28cfae6ee4a

Conversation

Copy link
Contributor

Copilot AI commented Sep 30, 2025

Overview

This PR fixes the validation logic in __test__/runTest.py to enforce correct key rules and proper remote field format validation, addressing inconsistent scraper outputs.

Changes Made

Updated Validation Rules

Before:

  • Required keys: job_title, job_link, city, county, country, company
  • No validation for field types or values
  • Accepted extra fields without validation

After:

  • Required keys (must be present with non-None values):

    • company
    • job_title
    • job_link
  • Optional keys (may be present):

    • city
    • county
    • remote
  • All other keys rejected (including country)

Remote Field Validation

Added comprehensive validation for the remote field when present:

  • Must be a list (not string or other type)
  • All values must be lowercase
  • Only allowed values: remote, on-site, hybrid

Valid examples:

{"remote": ["remote"]}
{"remote": ["on-site", "remote"]}
{"remote": ["hybrid"]}
{"remote": []}

Invalid examples:

{"remote": "remote"}           // ❌ Must be a list
{"remote": ["Remote"]}          // ❌ Must be lowercase
{"remote": ["work-from-home"]}  // ❌ Invalid value

Error Messages

Added clear, descriptive error messages for all validation failures:

  • Missing required keys: Required key 'company' is missing!
  • Invalid keys: Key 'country' is not allowed! Allowed keys are: company, job_title, job_link, city, county, remote
  • Invalid remote format: Key 'remote' must be a list, got str!
  • Invalid remote values: Remote value 'Remote' must be lowercase!

Bug Fix

Fixed regex pattern from (\[.*?\]) to (\[.*\]) to properly handle nested arrays in JSON output (e.g., "remote": ["remote"]).

Testing

Added comprehensive unit tests in __test__/test_validation.py:

  • 7 test cases for valid scraper outputs
  • 8 test cases for invalid outputs
  • All tests pass ✅

Run tests with:

python3 __test__/test_validation.py

Documentation

Added complete documentation in __test__/README.md including:

  • Detailed validation rules
  • Example valid and invalid outputs
  • Common validation errors with explanations
  • Usage instructions

Breaking Changes ⚠️

This PR introduces breaking changes. Existing scrapers that use:

  • country field (e.g., inetum.py, uipath.py)
  • remote as string instead of list (e.g., amazon.py)
  • Uppercase remote values (e.g., "Remote")

...will now fail validation and need to be updated to comply with the new rules.

Files Changed

  • __test__/runTest.py - Updated validation logic with new rules
  • __test__/test_validation.py - Added comprehensive unit tests (new file)
  • __test__/README.md - Added complete documentation (new file)

Fixes #[issue_number]

Original prompt

This section details on the original issue you should resolve

<issue_title>Fix scraper validation rules for allowed keys</issue_title>
<issue_description>## Description
Currently, the validation logic inside __test__/runTest.py checks for outdated keys and does not enforce correct formats.
This causes inconsistent scraper outputs.

Problems

  • Wrong set of required/optional keys.
  • Extra fields are sometimes accepted.
  • The remote field is not properly validated.

Expected behavior

  1. Required keys:

    • company
    • job_title
    • job_link
  2. Optional keys:

    • city
    • county
    • remote
  3. Remote field validation:

    • Must be a list containing one or more of these values:
      • remote
      • on-site
      • hybrid
    • Values must always be lowercase.
    • Example valid outputs:
      {"remote": ["remote"]}
      {"remote": ["on-site", "remote"]}
      {"remote": ["hybrid"]}
      {"remote": ["remote", "on-site", "hybrid"]}
  4. All other keys must be rejected.

Tasks

  • Update validation logic only in __test__/runTest.py to enforce new key rules.
  • Add check for remote format (list, lowercase, allowed values only).
  • Ensure invalid scrapers throw a clear error message.
  • Add tests for valid and invalid scraper outputs.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #652

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 5 commits September 30, 2025 21:28
… remote field validation

Co-authored-by: lalalaurentiu <67306273+lalalaurentiu@users.noreply.github.com>
Co-authored-by: lalalaurentiu <67306273+lalalaurentiu@users.noreply.github.com>
Co-authored-by: lalalaurentiu <67306273+lalalaurentiu@users.noreply.github.com>
Co-authored-by: lalalaurentiu <67306273+lalalaurentiu@users.noreply.github.com>
… examples

Co-authored-by: lalalaurentiu <67306273+lalalaurentiu@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix scraper validation rules for allowed keys Fix scraper validation rules for allowed keys and remote field format Sep 30, 2025
Copilot AI requested a review from lalalaurentiu September 30, 2025 21:37
@lalalaurentiu lalalaurentiu marked this pull request as ready for review October 2, 2025 04:12
@lalalaurentiu lalalaurentiu merged commit 645f331 into main Oct 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix scraper validation rules for allowed keys

2 participants