This package provides a collection of APIs designed for data processing and manipulation, enabling seamless ETL workflows and supporting a variety of use cases.
The APIs in this package are categorized into the following ETL process stages:
- Data Categorization
- Data Cleaning
- Data Enrichment
- Data Filtering
- Data Security
- Unstructured Data Extraction
categorizeNumeric: Categorizes a dataset based on a numeric field and specified ranges.categorizeRegexData: Categorizes a dataset based on a string field using a set of regular expressions.categorizeSemantic: Categorizes a dataset based on a string field using semantic classification.
groupApproximateDuplicates: Identifies and groups approximate duplicates in a dataset, returning a nested array with unique records first, followed by groups of similar records.handleWhiteSpaces: Returns a new dataset with all extra whitespace removed from string fields.removeDuplicates: Returns a new dataset with all duplicate records removed.removeEmptyValues: Returns a new dataset with all records containing nil or empty string values removed.removeField: Returns a new dataset with a specified field removed from each record.replaceText: Returns a new dataset where matches of the given regex pattern in a specified string field are replaced with a new value.sortData: Returns a new dataset sorted by a specified field in ascending or descending order.standardizeData: Returns a new dataset with all string values in a specified field standardized to a set of standard values.
joinData: Merges two datasets based on a common specified field and returns a new dataset with the merged records.mergeData: Merges multiple datasets into a single dataset by flattening a nested array of records.
filterDataByRatio: Filters a random set of records from a dataset based on a specified ratio.filterDataByRegex: Filters a dataset based on a regex pattern match.filterDataByRelativeExp: Filters a dataset based on a relative numeric comparison expression.
decryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.encryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.maskSensitiveData: Returns a new dataset with PII (Personally Identifiable Information) fields masked using a specified character
extractFromText: Extracts unstructured data from a string and maps it to a ballerina record.
Following APIs in this package utilize OpenAI services and require an OpenAI API key for operation.
categorizeSemanticextractFromTextgroupApproximateDuplicatesmaskSensitiveDatastandardizeData
Note: Configuration is required only for the APIs listed above. It is not needed for the use of any other APIs in this package.
- Create an OpenAI account and obtain an API key.
- Add the obtained API key and a supported GPT model in the
Config.tomlfile as shown below:
[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>""gpt-4-turbo""gpt-4o""gpt-4o-mini"
The default client timeout is set to 60 seconds. This value can be adjusted by specifying the timeout field as shown below:
[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>"
timeout = 120.0All APIs in this package support dependent types. Here is an example of how to use them:
import ballerina/etl;
import ballerina/io;
type Customer record {|
string name;
string city;
|};
public function main() returns error? {
Customer[] dataset = [
{ name: "Alice", city: "New York" },
{ name: "Bob", city: "Los Angeles" },
{ name: "Alice", city: "New York" }
];
Customer[] uniqueData = check etl:removeDuplicates(dataset);
io:println(`Customer Data Without Duplicates : ${uniqueData}`);
}The ballerina/etl package provides practical examples illustrating its usage in various scenarios. Explore these examples, covering different use cases:
-
Customer Data Processing - Processes customer data collected from various sources by extracting relevant information, cleaning and validating fields, enriching with additional metadata, and categorizing the data for downstream applications.
-
Product Catalog Processing - Consolidates product catalog data from multiple sources by extracting and merging entries, encrypting sensitive fields, classifying products into relevant categories, and storing the structured data securely in a MySQL database for easy access and analysis.
-
User Feedback Analysis - Handles raw user feedback by extracting and standardizing input, classifying comments based on content and sentiment, and storing the processed feedback for further analysis.
-
Download and install Java SE Development Kit (JDK) version 21. You can download it from either of the following sources:
Note: After installation, remember to set the
JAVA_HOMEenvironment variable to the directory where JDK was installed. -
Download and install Ballerina Swan Lake.
-
Download and install Docker.
Note: Ensure that the Docker daemon is running before executing any tests.
-
Export Github Personal access token with read package permissions as follows,
export packageUser=<Username> export packagePAT=<Personal access token>
Execute the commands below to build from the source.
-
To build the package:
./gradlew clean build
-
To run the tests:
./gradlew clean test -
To build the without the tests:
./gradlew clean build -x test -
To run tests against different environments:
./gradlew clean test -Pgroups=<Comma separated groups/test cases>
-
To debug the package with a remote debugger:
./gradlew clean build -Pdebug=<port>
-
To debug with the Ballerina language:
./gradlew clean build -PbalJavaDebug=<port>
-
Publish the generated artifacts to the local Ballerina Central repository:
./gradlew clean build -PpublishToLocalCentral=true
-
Publish the generated artifacts to the Ballerina Central repository:
./gradlew clean build -PpublishToCentral=true
As an open-source project, Ballerina welcomes contributions from the community.
For more information, go to the contribution guidelines.
All the contributors are encouraged to read the Ballerina Code of Conduct.
- For example demonstrations of the usage, go to Ballerina By Examples.
- Chat live with us via our Discord server.
- Post all technical questions on Stack Overflow with the #ballerina tag.