Ballerina ETL Library

Overview

This package provides a collection of APIs designed for data processing and manipulation, enabling seamless ETL workflows and supporting a variety of use cases.

The APIs in this package are categorized into the following ETL process stages:

Data Categorization
Data Cleaning
Data Enrichment
Data Filtering
Data Security
Unstructured Data Extraction

Features

Data Categorization

categorizeNumeric: Categorizes a dataset based on a numeric field and specified ranges.
categorizeRegexData: Categorizes a dataset based on a string field using a set of regular expressions.
categorizeSemantic: Categorizes a dataset based on a string field using semantic classification.

Data Cleaning

groupApproximateDuplicates: Identifies and groups approximate duplicates in a dataset, returning a nested array with unique records first, followed by groups of similar records.
handleWhiteSpaces: Returns a new dataset with all extra whitespace removed from string fields.
removeDuplicates: Returns a new dataset with all duplicate records removed.
removeEmptyValues: Returns a new dataset with all records containing nil or empty string values removed.
removeField: Returns a new dataset with a specified field removed from each record.
replaceText: Returns a new dataset where matches of the given regex pattern in a specified string field are replaced with a new value.
sortData: Returns a new dataset sorted by a specified field in ascending or descending order.
standardizeData: Returns a new dataset with all string values in a specified field standardized to a set of standard values.

Data Enrichment

joinData: Merges two datasets based on a common specified field and returns a new dataset with the merged records.
mergeData: Merges multiple datasets into a single dataset by flattening a nested array of records.

Data Filtering

filterDataByRatio: Filters a random set of records from a dataset based on a specified ratio.
filterDataByRegex: Filters a dataset based on a regex pattern match.
filterDataByRelativeExp: Filters a dataset based on a relative numeric comparison expression.

Data Security

decryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.
encryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.
maskSensitiveData: Returns a new dataset with PII (Personally Identifiable Information) fields masked using a specified character

Unstructured Data Extraction

extractFromText: Extracts unstructured data from a string and maps it to a ballerina record.

Usage

Configurations

Following APIs in this package utilize OpenAI services and require an OpenAI API key for operation.

categorizeSemantic
extractFromText
groupApproximateDuplicates
maskSensitiveData
standardizeData

Note: Configuration is required only for the APIs listed above. It is not needed for the use of any other APIs in this package.

Setting up the OpenAI API Key

Create an OpenAI account and obtain an API key.
Add the obtained API key and a supported GPT model in the Config.toml file as shown below:

[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>"

Supported GPT Models

"gpt-4-turbo"
"gpt-4o"
"gpt-4o-mini"

(Optional) Overriding Client Timeout

The default client timeout is set to 60 seconds. This value can be adjusted by specifying the timeout field as shown below:

[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>"
timeout = 120.0

Dependent Type Support

All APIs in this package support dependent types. Here is an example of how to use them:

import ballerina/etl;
import ballerina/io;

type Customer record {|
   string name;
   string city;
|};

public function main() returns error? {
   Customer[] dataset = [
      { name: "Alice", city: "New York" },
      { name: "Bob", city: "Los Angeles" },
      { name: "Alice", city: "New York" }
   ];
   Customer[] uniqueData = check etl:removeDuplicates(dataset);
   io:println(`Customer Data Without Duplicates : ${uniqueData}`);
}

Examples

The ballerina/etl package provides practical examples illustrating its usage in various scenarios. Explore these examples, covering different use cases:

Customer Data Processing - Processes customer data collected from various sources by extracting relevant information, cleaning and validating fields, enriching with additional metadata, and categorizing the data for downstream applications.
Product Catalog Processing - Consolidates product catalog data from multiple sources by extracting and merging entries, encrypting sensitive fields, classifying products into relevant categories, and storing the structured data securely in a MySQL database for easy access and analysis.
User Feedback Analysis - Handles raw user feedback by extracting and standardizing input, classifying comments based on content and sentiment, and storing the processed feedback for further analysis.

Build from the source

Setting up the prerequisites

Download and install Java SE Development Kit (JDK) version 21. You can download it from either of the following sources:
- Oracle JDK
- OpenJDK
Note: After installation, remember to set the JAVA_HOME environment variable to the directory where JDK was installed.
Download and install Ballerina Swan Lake.
Download and install Docker.

Note: Ensure that the Docker daemon is running before executing any tests.
Export Github Personal access token with read package permissions as follows,
```
export packageUser=<Username>
export packagePAT=<Personal access token>
```

Build options

Execute the commands below to build from the source.

To build the package:
```
./gradlew clean build
```
To run the tests:
```
./gradlew clean test
```
To build the without the tests:
```
./gradlew clean build -x test
```

To run tests against different environments:

./gradlew clean test -Pgroups=<Comma separated groups/test cases>

To debug the package with a remote debugger:
```
./gradlew clean build -Pdebug=<port>
```

To debug with the Ballerina language:

./gradlew clean build -PbalJavaDebug=<port>

Publish the generated artifacts to the local Ballerina Central repository:
```
./gradlew clean build -PpublishToLocalCentral=true
```
Publish the generated artifacts to the Ballerina Central repository:
```
./gradlew clean build -PpublishToCentral=true
```

Contribute to Ballerina

As an open-source project, Ballerina welcomes contributions from the community.

For more information, go to the contribution guidelines.

Code of conduct

All the contributors are encouraged to read the Ballerina Code of Conduct.

Useful links

For example demonstrations of the usage, go to Ballerina By Examples.
Chat live with us via our Discord server.
Post all technical questions on Stack Overflow with the #ballerina tag.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
ballerina		ballerina
build-config		build-config
docs/spec		docs/spec
examples		examples
gradle/wrapper		gradle/wrapper
native		native
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
issue_template.md		issue_template.md
pull_request_template.md		pull_request_template.md
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ballerina ETL Library

Overview

Features

Data Categorization

Data Cleaning

Data Enrichment

Data Filtering

Data Security

Unstructured Data Extraction

Usage

Configurations

Setting up the OpenAI API Key

Supported GPT Models

(Optional) Overriding Client Timeout

Dependent Type Support

Examples

Build from the source

Setting up the prerequisites

Build options

Contribute to Ballerina

Code of conduct

Useful links

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

License

ballerina-platform/module-ballerina-etl

Folders and files

Latest commit

History

Repository files navigation

Ballerina ETL Library

Overview

Features

Data Categorization

Data Cleaning

Data Enrichment

Data Filtering

Data Security

Unstructured Data Extraction

Usage

Configurations

Setting up the OpenAI API Key

Supported GPT Models

(Optional) Overriding Client Timeout

Dependent Type Support

Examples

Build from the source

Setting up the prerequisites

Build options

Contribute to Ballerina

Code of conduct

Useful links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

Packages