Spark-FHIR Healthy Aging Dataset Extractor

This project leverages Apache Spark and the spark-on-fhir toolkit to flatten complex clinical data (Observations, QuestionnaireResponses) into analytics-ready CSVs. Simultaneously, it generates rich metadata (DCAT, CSVW, SKOS) and publishes it to a FAIR Data Point (FDP).

Project Overview

The pipeline performs the following key operations:

Extracts raw FHIR resources (Patient, Observation, QuestionnaireResponse, Questionnaire).
Resolves terminology by joining patient answers with full Questionnaire definitions.
Transforms data into a wide-format patient profile with one row per patient.
Generates FAIR metadata:

DCAT: Catalog, Dataset, and Distribution descriptions.
CSVW: Schema definitions for the output data.
SKOS: Concept schemes mapping survey codes to human-readable displays.

**Publishes metadata directly to a FAIR Data Point (FDP) and/or saves locally as Turtle (.ttl) files.

Prerequisites

Before running the application, ensure you have the following environment set up:

Java 11+
Apache Spark 3.5.x
FHIR Server: A running R4 FHIR Server (e.g., OnFhir) containing your source data.
FDP Server (Optional): A running FAIR Data Point instance if you intend to publish metadata remotely.
Dependencies:
spark-on-fhir-sdk (Ensure this is available in your Maven repo)
onfhir-feast (Used for metadata component extraction)

Building the Project

Clone the repository and build the "fat JAR" using Maven. This will bundle all necessary Scala dependencies.

cd stage-fhir-fdp-adapter
mvn -DskipTests clean package

Configuration

You can configure the pipeline using either a JSON file or an Excel spreadsheet. This file controls the inputs (FHIR URL), outputs (DCAT metadata), and behavior.

1. Configure Data Source & Metadata

Edit config.json (or config.xlsx) to define your target FHIR server and metadata properties.

Parameter	Description
`fhirUrl`	Base URL of the source FHIR server.
`fdpUrl`	(Optional) URL of the target FAIR Data Point.
`catalogTitle`	Title of the Data Catalog to be created.
`datasetTitle`	Title of the specific Dataset.
`outputDir`	Local directory to save CSV and TTL files.

2. Configure Run Mode

The application supports different ways to load these configurations:

browser (Default): Loads a web form to fill in the config data in runtime..
json: Automatically loads a standard config.json from the classpath/working dir.
excel: Automatically loads a standard config.xlsx from the classpath/working dir.

Usage

Use the provided shell script to submit the Spark job. You can customize the job type, output format, and configuration mode via CLI arguments.

./run-cli.sh --job survey --format csv --runMode browser

CLI Arguments

Argument	Default	Description
`--job`	`survey`	The ETL pipeline to run. Currently supports `survey` (Healthy Aging). Can be extended for other cohorts.
`--format`	`csv`	The output format for the patient data (`csv` or `parquet`).
`--runMode`	`browser`	How the app loads configuration (`json`, `excel`, or `browser`).

Output

After a successful run, the outputDir will contain:

patient_profiles/: A folder containing the extracted data in CSV format (one row per patient, columns for every Observation and Survey Question).
Catalog.ttl: RDF description of the Data Catalog.
Dataset.ttl: RDF description of the Dataset, linked to the Catalog.
Distribution.ttl: RDF description of the CSV file, linked to the Dataset.
CSVW.ttl: W3C CSV-on-the-Web schema describing columns and data types.
Vocabularies.ttl: SKOS concepts defining the questions and answer options found in the survey.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-FHIR Healthy Aging Dataset Extractor

Project Overview

Prerequisites

Building the Project

Configuration

1. Configure Data Source & Metadata

2. Configure Run Mode

Usage

CLI Arguments

Output

About

Uh oh!

Releases

Packages

srdc/stage-fhir-fdp-adapter

Folders and files

Latest commit

History

Repository files navigation

Spark-FHIR Healthy Aging Dataset Extractor

Project Overview

Prerequisites

Building the Project

Configuration

1. Configure Data Source & Metadata

2. Configure Run Mode

Usage

CLI Arguments

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages