Skip to content

feat: support Spark-compatible string_to_map function#20120

Draft
unknowntpo wants to merge 18 commits intoapache:mainfrom
unknowntpo:feat-string-to-map-fn
Draft

feat: support Spark-compatible string_to_map function#20120
unknowntpo wants to merge 18 commits intoapache:mainfrom
unknowntpo:feat-string-to-map-fn

Conversation

@unknowntpo
Copy link

@unknowntpo unknowntpo commented Feb 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  • Add Spark-compatible string_to_map function in datafusion-spark crate
  • Function signature: string_to_map(text, [pairDelim], [keyValueDelim]) -> Map<String, String>
    • text: The input string
    • pairDelim: Delimiter between key-value pairs (default: ,)
    • keyValueDelim: Delimiter between key and value (default: :)
  • Supports alias str_to_map
  • Located in function/map/ module (returns Map type)

Examples

SELECT string_to_map('a:1,b:2,c:3');
-- {a: 1, b: 2, c: 3}

SELECT string_to_map('a=1;b=2', ';', '=');
-- {a: 1, b: 2}

SELECT str_to_map('key:value');
-- {key: value}

Are these changes tested?

Are there any user-facing changes?

Yes, the string_to_map and str_to_map functions are now available in Spark-compatible SQL.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 3, 2026
unknowntpo and others added 6 commits February 3, 2026 10:12
Adds string_to_map (alias: str_to_map) function that creates a map
from a string by splitting on delimiters.

- Supports 1-3 args: text, pair_delim (default ','), key_value_delim (default ':')
- Returns Map<Utf8, Utf8>
- NULL input returns NULL
- Empty string returns {"": NULL} (Spark behavior)
- Missing key_value_delim results in NULL value
- Duplicate keys: last wins (LAST_WIN policy)

Test cases derived from Spark v4.0.0 ComplexTypeSuite.scala.
The function returns Map type so it belongs in the map module.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Follows the source code move in the previous commit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@unknowntpo unknowntpo force-pushed the feat-string-to-map-fn branch from 7fa0217 to f97c89e Compare February 3, 2026 02:12
for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
null_buffer[row_idx] = false;
offsets.push(*offsets.last().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the last() call return None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, offsets is initialized with one element 0

Copy link
Author

@unknowntpo unknowntpo Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, last() will never return None here. The offsets vector is initialized with vec![0], so it always has at least one element before the loop starts.

I've refactor this and introduce a current_offset variable to avoid confusion.

# Test cases derived from Spark test("StringToMap"):
# https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
#
# Note: Duplicate key handling uses LAST_WIN policy (not EXCEPTION which is Spark default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// <https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map>
///
/// Creates a map from a string by splitting on delimiters.
/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (I copied from the spark link above), should keep consistent with spark doc IMO

Suggested change
/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change to str_to_map.

}

fn string_to_map_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let text_array = &args[0];
Copy link
Contributor

@dentiny dentiny Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious should we check (or assert, if the signature already guards against bad usage) arg count cannot be >= 4? And add unit test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        Self {
            signature: Signature::one_of(
                vec![
                    // string_to_map(text)
                    TypeSignature::String(1),
                    // string_to_map(text, pairDelim)
                    TypeSignature::String(2),
                    // string_to_map(text, pairDelim, keyValueDelim)
                    TypeSignature::String(3),
                ],
                Volatility::Immutable,
            ),
            aliases: vec![String::from("str_to_map")],
        }

signature make sure that this will not happened.
but I've added assertion for defense.

for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
null_buffer[row_idx] = false;
offsets.push(*offsets.last().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, offsets is initialized with one element 0

}

/// Extract scalar string value from array (assumes all values are the same)
fn get_scalar_string(array: &ArrayRef) -> Result<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_scalar_string(array: &ArrayRef) -> Result<String> {
fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Do you think it matches the intention (since you clearly said it's delim parsing at L216)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, renamed to extract_delimiter_from_string_array with proper testing.

)
})?;

if string_array.len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious in which case will the len be 0? I thought we should assert the len 😲

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've added a assertion here.

// "a:" -> kv = ["a", ""] -> key="a", value=Some("")
// ":1" -> kv = ["", "1"] -> key="", value=Some("1")
let kv: Vec<&str> = pair.splitn(2, &kv_delim).collect();
let key = kv[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut iter = pair.splitn(2, kv_delim);
let key = iter.next().unwrap_or("");
let value = iter.next().unwrap_or(None);

so we don't need heap allocation for vector?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


// Split text into key-value pairs using pair_delim.
// Example: "a:1,b:2" with pair_delim="," -> ["a:1", "b:2"]
let pairs: Vec<&str> = text.split(&pair_delim).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pair in text.split(pair_delim) {

to avoid heap allocation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

unknowntpo and others added 11 commits February 3, 2026 18:00
- Replace len==0 check with assert (delimiter array should never be empty)
- Add comment explaining scalar expansion in columnar execution
- Add unit test for delimiter extraction (single, multi-char, expanded scalar)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add multi-row test with default delimiters
- Add multi-row test with custom delimiters (comma and equals)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace `offsets.last().unwrap()` with explicit `current_offset` tracking
- Add table-driven unit tests covering s0-s6 Spark test cases + null input
- Add multi-row test demonstrating Arrow MapArray internal structure
- Import NullBuffer at module level for cleaner code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documents current behavior and adds TODO for Spark's EXCEPTION default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
}

fn name(&self) -> &str {
"string_to_map"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference to this alias? As far as I can tell Spark only has str_to_map

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change to str_to_map.

};

// Process each row
let text_array = text_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let text_array = as_string_array(text_array)?;

Easier downcasting: https://docs.rs/datafusion/latest/datafusion/common/cast/fn.as_string_array.html

However we need to consider that other string types exist such as LargeUtf8 and Utf8View

"Delimiter array should not be empty"
);

// In columnar execution, scalar delimiter is expanded to array to match batch size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume this; for example this is a valid test case that will fail:

query ?
SELECT string_to_map(col1, col2, col3)
FROM (VALUES ('a=1,b=2', ',', '='), ('x#9', ',', '#'), (NULL, ',', '=')) AS t(col1, col2, col3);
----
{a: 1, b: 2}
{x: 9}
NULL
  • Delimiters can vary per row

We should either choose to support only scalar delimiters for now (look at invoke_with_args and how we can work with ColumnarValues directly) or need to ensure we respect per-row delimiters

// Test cases derived from Spark ComplexTypeSuite:
// https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
#[test]
fn test_string_to_map_cases() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move all these test cases to SLTs?

let mut null_buffer = vec![true; num_rows];

for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to support per-row delimiters we'll need to consider their nullability; could consider using NullBuffer::union to build the final nullbuffer upfront once, though keep in mind we'll have up to 3 input arrays

keys_builder.append_value("");
values_builder.append_null();
current_offset += 1;
offsets.push(current_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered using MapBuilder here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants