Here is the core of the peviitor project.
Mostly, it's about data and the quality of data. But it's also about getting the data you are searching for using a full-text indexed search engine. Workflows, pipelines and code that is validating the rules, keeping the index up-to-date is also part of peviitor_core.
| Field | Type | Required | Description and rules |
|---|---|---|---|
| url | string | Yes | Full URL to the job detail page. unique. url must be valid HTTP/HTTPS URL, canonical job detail page |
| title | string | Yes | Exact position title. title max 200 chars, no HTML, trimmed whitespace, DIACRITICS ACCEPTED (ăâîșțĂÂÎȘȚ) |
| company | string | No | Name of the hiring company. Real name. Full name. not just a brand or a code. Legal name. company must match exactly Company.name (case insensitive, **DIACR[...] |
| cif | string | No | CIF/CUI. Due to the fact that Systematic SRL exist with same name in 3 different counties Bihor, Arad, Timis |
| location | string[] | No | Location or detailed address. location Romanian cities/addresses, DIACRITICS ACCEPTED (ex: "București", "Cluj-Napoca"). multi-valued, stored as array |
| tags | string[] | No | Tag-uri skills/educație/experiență. tags lowercase, max 20 entries, standardized values only, NO DIACRITICS |
| workmode | string | No | "remote", "on-site", "hybrid". workmode only: "remote", "on-site", "hybrid" |
| date | date | No | Data scrape/indexare (ISO8601). date = UTC ISO8601 timestamp of scrape (ex: "2026-01-18T10:00:00Z") |
| status | string | No | "scraped", "tested", "published", "verified". status starts "scraped", progresses: scraped → tested → published → verified |
| vdate | date | No | Verified date (ISO8601). vdate set only when validation="verified" |
| expirationdate | date | No | Data expirare estimată job. expirationdate = vdate + 30 days max, or extract from job page |
| salary | string | No | Interval salarial + currency (ex: "5000-8000 RON", "4000 EUR"). salary format: "MIN-MAX CURRENCY" ; must be a string not an array. |
Note: Fields marked as
string[]are multi-valued arrays. In SOLR/OpenSearch these are stored as arrays (e.g.,["București", "Cluj-Napoca"]).
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | CIF/CUI of the company (e.g. "12345678"). id = exact CIF/CUI 8 digits (no RO prefix). |
| company | string | Yes | Exact name for job matching. company = legal name from Trade Register, DIACRITICS REQUIRED (e.g. "Tehnologia Informației"). |
| status | string | No | Status: "activ", "suspendat", "inactiv", "radiat". If company status is not active, remove jobs; also remove company. status only: "activ", "suspendat", "inactiv", "radiat". |
| location | string[] | No | Location or detailed address. location Romanian cities/addresses, DIACRITICS ACCEPTED (e.g. "București", "Cluj-Napoca"). multi-valued, stored as array |
| website | string[] | No | Official company website. website must be a valid HTTP/HTTPS URL, preferably canonical, without trailing slash (e.g. "https://www.example.ro"). multi-valued, stored as array |
| career | string[] | No | Official company career page. career must be a valid HTTP/HTTPS URL, preferably canonical, without trailing slash, pointing to the jobs/careers section (e.g. "https://www.example.ro/careers"). multi-valued, stored as array |
Note: Fields marked as
string[]are multi-valued arrays. In SOLR/OpenSearch these are stored as arrays (e.g.,["https://example.com", "https://careers.example.com"]).
| Technology | Status | Use Case | Notes |
|---|---|---|---|
| Apache SOLR | ✅ Primary | Production indexing, diacritics RO, complex schemas | Job/Company/Auth models, cron integration |
| OpenSearch | ✅ Primary | SOLR alternative, AWS compatible | Same schema, managed hosting |
| Elasticsearch | Legacy compatibility | Existing peviitor scrapers | |
| Typesense | 🚀 MVP/Prototype | Ultra-fast UI search (<50ms) | Typo-tolerant, developer friendly |
- Project is OPEN SOURCE.
- Security and procedures related to ways of working will be part of the project.
- How to connect and how to use it will be captured in documentation.
- All pull requests will be documented.
- peviitor core is not a closed project but an extensible one.
Performance, reliability, recovery from disaster, scalability, and validity are the most valuable benefits this project delivers.
- analyzer: "romanian" preserves diacritics ȘȚĂÂÎ
- search: "Bucuresti" matches "București" automatically
Purpose: Remove expired job listings automatically
Schedule: Daily @ 02:00 AM EET Logic:
- DELETE jobs WHERE
expirationdate< NOW() ANDvalidation="verified" - SOLR/OpenSearch:
delete_by_queryrange query onexpirationdate
Purpose: Validate job URLs are still active DAILY
Schedule: Daily @ 06:00 AM EET
Workflow:
- SELECT jobs WHERE
validation="verified" ANDdate> 1 day ago - Parallel HTTP HEAD requests (max 1000 concurrent) to
job_link - 404 → DELETE job immediately
- 200 OK → Parse content for "expirat"/"ocupat"/"închis"/"no longer available"/"filled"
- Invalid content → SET
validation="tested", schedule recheck in 24h - Valid → UPDATE
validation="verified",vdate=NOW() Max batch: 50k jobs/day, prioritize newest first
The following components are considered plugins for the peviitor core project:
- BFF API — Backend-for-Frontend API layers that tailor data and endpoints for different clients.
- UI — Web or mobile user interfaces and frontend components.
- Scrapers — Automated data collectors that fetch and normalize job and company data.
- Manual data validator — Tools or interfaces used by humans to validate and correct data.
- Integrations — Connectors to external services (analytics, exporters, auth providers, etc.).