This project leverages Apache Spark and the spark-on-fhir toolkit to flatten complex clinical data (Observations, QuestionnaireResponses) into analytics-ready CSVs. Simultaneously, it generates rich metadata (DCAT, CSVW, SKOS) and publishes it to a FAIR Data Point (FDP).
The pipeline performs the following key operations:
- Extracts raw FHIR resources (
Patient,Observation,QuestionnaireResponse,Questionnaire). - Resolves terminology by joining patient answers with full Questionnaire definitions.
- Transforms data into a wide-format patient profile with one row per patient.
- Generates FAIR metadata:
- DCAT: Catalog, Dataset, and Distribution descriptions.
- CSVW: Schema definitions for the output data.
- SKOS: Concept schemes mapping survey codes to human-readable displays.
- **Publishes metadata directly to a FAIR Data Point (FDP) and/or saves locally as Turtle (
.ttl) files.
Before running the application, ensure you have the following environment set up:
- Java 11+
- Apache Spark 3.5.x
- FHIR Server: A running R4 FHIR Server (e.g., OnFhir) containing your source data.
- FDP Server (Optional): A running FAIR Data Point instance if you intend to publish metadata remotely.
- Dependencies:
spark-on-fhir-sdk(Ensure this is available in your Maven repo)onfhir-feast(Used for metadata component extraction)
Clone the repository and build the "fat JAR" using Maven. This will bundle all necessary Scala dependencies.
cd stage-fhir-fdp-adapter
mvn -DskipTests clean package
You can configure the pipeline using either a JSON file or an Excel spreadsheet. This file controls the inputs (FHIR URL), outputs (DCAT metadata), and behavior.
Edit config.json (or config.xlsx) to define your target FHIR server and metadata properties.
| Parameter | Description |
|---|---|
fhirUrl |
Base URL of the source FHIR server. |
fdpUrl |
(Optional) URL of the target FAIR Data Point. |
catalogTitle |
Title of the Data Catalog to be created. |
datasetTitle |
Title of the specific Dataset. |
outputDir |
Local directory to save CSV and TTL files. |
The application supports different ways to load these configurations:
browser(Default): Loads a web form to fill in the config data in runtime..json: Automatically loads a standardconfig.jsonfrom the classpath/working dir.excel: Automatically loads a standardconfig.xlsxfrom the classpath/working dir.
Use the provided shell script to submit the Spark job. You can customize the job type, output format, and configuration mode via CLI arguments.
./run-cli.sh --job survey --format csv --runMode browser
| Argument | Default | Description |
|---|---|---|
--job |
survey |
The ETL pipeline to run. Currently supports survey (Healthy Aging). Can be extended for other cohorts. |
--format |
csv |
The output format for the patient data (csv or parquet). |
--runMode |
browser |
How the app loads configuration (json, excel, or browser). |
After a successful run, the outputDir will contain:
patient_profiles/: A folder containing the extracted data in CSV format (one row per patient, columns for every Observation and Survey Question).Catalog.ttl: RDF description of the Data Catalog.Dataset.ttl: RDF description of the Dataset, linked to the Catalog.Distribution.ttl: RDF description of the CSV file, linked to the Dataset.CSVW.ttl: W3C CSV-on-the-Web schema describing columns and data types.Vocabularies.ttl: SKOS concepts defining the questions and answer options found in the survey.