Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions __test__/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Scraper Validation Tests

This directory contains validation tests for the scraper system.

## Files

### runTest.py
Main validation script that checks scrapers against the required output format.
It runs all changed scraper files and validates their JSON output.

**Validation Rules:**

1. **Required keys** (must be present with non-None values):
- `company`: The company name
- `job_title`: The job title
- `job_link`: URL to the job posting

2. **Optional keys** (may be present):
- `city`: City location(s)
- `county`: County location(s)
- `remote`: Work arrangement options (must be a list)

3. **Remote field validation** (when present):
- Must be a list (even if empty)
- All values must be lowercase
- Only allowed values: `remote`, `on-site`, `hybrid`
- Valid examples:
```json
{"remote": ["remote"]}
{"remote": ["on-site", "hybrid"]}
{"remote": ["remote", "on-site", "hybrid"]}
{"remote": []}
```

4. **All other keys are rejected** - no additional fields are allowed.

### test_validation.py
Unit tests for the validation logic. Tests both valid and invalid scraper outputs
to ensure the validation rules work correctly.

Run with:
```bash
python3 __test__/test_validation.py
```

### publish.py
Script for publishing validated jobs to the API.

## Running Tests

To run the validation tests:

```bash
# Run unit tests
python3 __test__/test_validation.py

# Run validation on changed scrapers
python3 __test__/runTest.py
```

## Example Valid Scraper Output

```json
[
{
"company": "ExampleCompany",
"job_title": "Software Engineer",
"job_link": "https://example.com/jobs/123",
"city": "București",
"county": "București",
"remote": ["remote", "on-site"]
},
{
"company": "ExampleCompany",
"job_title": "Data Scientist",
"job_link": "https://example.com/jobs/124"
}
]
```

## Common Validation Errors

### ❌ Using 'country' field
```json
{
"company": "Test",
"job_title": "Engineer",
"job_link": "https://example.com/job",
"country": "Romania" // NOT ALLOWED
}
```
**Error:** `Key 'country' is not allowed! Allowed keys are: company, job_title, job_link, city, county, remote`

### ❌ Remote as string
```json
{
"company": "Test",
"job_title": "Engineer",
"job_link": "https://example.com/job",
"remote": "remote" // Should be ["remote"]
}
```
**Error:** `Key 'remote' must be a list, got str!`

### ❌ Uppercase remote value
```json
{
"company": "Test",
"job_title": "Engineer",
"job_link": "https://example.com/job",
"remote": ["Remote"] // Should be lowercase
}
```
**Error:** `Remote value 'Remote' must be lowercase!`

### ❌ Invalid remote value
```json
{
"company": "Test",
"job_title": "Engineer",
"job_link": "https://example.com/job",
"remote": ["work-from-home"] // Not in allowed list
}
```
**Error:** `Remote value 'work-from-home' is not allowed! Allowed values are: remote, on-site, hybrid`

### ❌ Missing required field
```json
{
"job_title": "Engineer",
"job_link": "https://example.com/job"
// Missing 'company'
}
```
**Error:** `Required key 'company' is missing!`
79 changes: 72 additions & 7 deletions __test__/runTest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,35 @@
"""
Scraper Validation Test Script

This script validates scrapers by checking their output against required rules.

Validation Rules:
-----------------
1. Required keys (must be present with non-None values):
- company: The company name
- job_title: The job title
- job_link: URL to the job posting

2. Optional keys (may be present):
- city: City location(s)
- county: County location(s)
- remote: Work arrangement options (must be a list)

3. Remote field validation (when present):
- Must be a list (even if empty)
- All values must be lowercase
- Only allowed values: 'remote', 'on-site', 'hybrid'
- Examples:
* {"remote": ["remote"]}
* {"remote": ["on-site", "hybrid"]}
* {"remote": []}

4. All other keys are rejected.

The script runs all changed scraper files (in sites/ directory) and validates
their JSON output against these rules.
"""

import subprocess
import json
import re
Expand All @@ -12,22 +44,55 @@
for file in files:
print(f'Running {file} ...')
if file.startswith('sites/'):
run_file = subprocess.run(["python3", os.getcwd() + "/" + file], capture_output=True).stdout.decode('utf-8')
directory = os.path.abspath(file).rsplit('/', 1)[0].replace('__test__/', '')
file_name = file.rsplit('/', 1)[1]

pattern = re.compile(r"(\[.*?\])", re.DOTALL)
run_file = subprocess.run(["python3", directory + "/" + file_name], capture_output=True).stdout.decode('utf-8')

pattern = re.compile(r"(\[.*\])", re.DOTALL)
matches = pattern.findall(run_file)

scraper_obj = json.loads(matches[0])

keys = ['job_title', 'job_link', 'city', 'county','country', 'company']
# Define required and optional keys
required_keys = ['company', 'job_title', 'job_link']
optional_keys = ['city', 'county', 'remote']
allowed_keys = required_keys + optional_keys

# Define allowed remote values
allowed_remote_values = ['remote', 'on-site', 'hybrid']

for job in scraper_obj:
# Check that all required keys are present
for req_key in required_keys:
if req_key not in job:
raise Exception(f"Required key '{req_key}' is missing! \n {job}")

# Check each key in the job
for key, value in job.items():
if key not in keys:
raise Exception(f"Key {key} is not allowed!")
# Reject keys that are not in the allowed list
if key not in allowed_keys:
raise Exception(f"Key '{key}' is not allowed! Allowed keys are: {', '.join(allowed_keys)}")

if value == None:
raise Exception(f"Key {key} has no value! \n {job}")
# Check that required keys have non-None values
if key in required_keys and value == None:
raise Exception(f"Required key '{key}' has no value! \n {job}")

# Validate remote field format
if key == 'remote':
# Remote must be a list
if not isinstance(value, list):
raise Exception(f"Key 'remote' must be a list, got {type(value).__name__}! \n {job}")

# Check each value in the remote list
for remote_val in value:
# Must be lowercase
if remote_val != remote_val.lower():
raise Exception(f"Remote value '{remote_val}' must be lowercase! \n {job}")

# Must be in allowed values
if remote_val not in allowed_remote_values:
raise Exception(f"Remote value '{remote_val}' is not allowed! Allowed values are: {', '.join(allowed_remote_values)} \n {job}")

print(f'✅ {file}')
else:
Expand Down
Loading