feat: support Spark-compatible `string_to_map` function by unknowntpo · Pull Request #20120 · apache/datafusion

unknowntpo · 2026-02-03T01:57:44Z

Which issue does this PR close?

Part of [EPIC] Complete datafusion-spark Spark Compatible Functions #15914
Related comet issue: [Feature] Support Spark expression: string_to_map datafusion-comet#3168

Rationale for this change

Apache Spark's string_to_map (alias str_to_map) creates a map by splitting a string into key-value pairs using delimiters.
This function is used in Spark SQL and needed for DataFusion-Comet compatibility.
Reference: https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map

What changes are included in this PR?

Add Spark-compatible string_to_map function in datafusion-spark crate
Function signature: string_to_map(text, [pairDelim], [keyValueDelim]) -> Map<String, String>
- text: The input string
- pairDelim: Delimiter between key-value pairs (default: ,)
- keyValueDelim: Delimiter between key and value (default: :)
Supports alias str_to_map
Located in function/map/ module (returns Map type)

Examples

SELECT string_to_map('a:1,b:2,c:3');
-- {a: 1, b: 2, c: 3}

SELECT string_to_map('a=1;b=2', ';', '=');
-- {a: 1, b: 2}

SELECT str_to_map('key:value');
-- {key: value}

Are these changes tested?

sqllogictest: test_files/spark/map/string_to_map.slt, test cases derived from Spark test("StringToMap"):

Are there any user-facing changes?

Yes, the string_to_map and str_to_map functions are now available in Spark-compatible SQL.

Adds string_to_map (alias: str_to_map) function that creates a map from a string by splitting on delimiters. - Supports 1-3 args: text, pair_delim (default ','), key_value_delim (default ':') - Returns Map<Utf8, Utf8> - NULL input returns NULL - Empty string returns {"": NULL} (Spark behavior) - Missing key_value_delim results in NULL value - Duplicate keys: last wins (LAST_WIN policy) Test cases derived from Spark v4.0.0 ComplexTypeSuite.scala.

The function returns Map type so it belongs in the map module. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Follows the source code move in the previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hsiang-c · 2026-02-03T06:19:49Z

datafusion/spark/src/function/map/string_to_map.rs

+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {
+            null_buffer[row_idx] = false;
+            offsets.push(*offsets.last().unwrap());


Will the last() call return None?

no, offsets is initialized with one element 0

No, last() will never return None here. The offsets vector is initialized with vec![0], so it always has at least one element before the loop starts.

I've refactor this and introduce a current_offset variable to avoid confusion.

hsiang-c · 2026-02-03T06:30:44Z

datafusion/sqllogictest/test_files/spark/map/string_to_map.slt

+# Test cases derived from Spark test("StringToMap"):
+# https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
+#
+# Note: Duplicate key handling uses LAST_WIN policy (not EXCEPTION which is Spark default)


FYI: https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4502-L4511

dentiny · 2026-02-03T07:43:17Z

datafusion/spark/src/function/map/string_to_map.rs

+/// <https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map>
+///
+/// Creates a map from a string by splitting on delimiters.
+/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>


nit (I copied from the spark link above), should keep consistent with spark doc IMO

Suggested change

/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>

/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

You're right, I'll change to str_to_map.

dentiny · 2026-02-03T07:46:01Z

datafusion/spark/src/function/map/string_to_map.rs

+}
+
+fn string_to_map_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
+    let text_array = &args[0];


Curious should we check (or assert, if the signature already guards against bad usage) arg count cannot be >= 4? And add unit test?

Self { signature: Signature::one_of( vec![ // string_to_map(text) TypeSignature::String(1), // string_to_map(text, pairDelim) TypeSignature::String(2), // string_to_map(text, pairDelim, keyValueDelim) TypeSignature::String(3), ], Volatility::Immutable, ), aliases: vec![String::from("str_to_map")], }

signature make sure that this will not happened.
but I've added assertion for defense.

dentiny · 2026-02-03T07:58:10Z

datafusion/spark/src/function/map/string_to_map.rs

+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {
+            null_buffer[row_idx] = false;
+            offsets.push(*offsets.last().unwrap());


no, offsets is initialized with one element 0

dentiny · 2026-02-03T08:04:02Z

datafusion/spark/src/function/map/string_to_map.rs

+}
+
+/// Extract scalar string value from array (assumes all values are the same)
+fn get_scalar_string(array: &ArrayRef) -> Result<String> {


Suggested change

fn get_scalar_string(array: &ArrayRef) -> Result<String> {

fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Do you think it matches the intention (since you clearly said it's delim parsing at L216)?

Good catch, renamed to extract_delimiter_from_string_array with proper testing.

dentiny · 2026-02-03T08:05:39Z

datafusion/spark/src/function/map/string_to_map.rs

+            )
+        })?;
+
+    if string_array.len() == 0 {


curious in which case will the len be 0? I thought we should assert the len 😲

Good catch, I've added a assertion here.

dentiny · 2026-02-03T08:09:47Z

datafusion/spark/src/function/map/string_to_map.rs

+            //   "a:"    -> kv = ["a", ""]    -> key="a", value=Some("")
+            //   ":1"    -> kv = ["", "1"]    -> key="",  value=Some("1")
+            let kv: Vec<&str> = pair.splitn(2, &kv_delim).collect();
+            let key = kv[0];


let mut iter = pair.splitn(2, kv_delim); let key = iter.next().unwrap_or(""); let value = iter.next().unwrap_or(None);

so we don't need heap allocation for vector?

dentiny · 2026-02-03T08:10:47Z

datafusion/spark/src/function/map/string_to_map.rs

+
+        // Split text into key-value pairs using pair_delim.
+        // Example: "a:1,b:2" with pair_delim="," -> ["a:1", "b:2"]
+        let pairs: Vec<&str> = text.split(&pair_delim).collect();


for pair in text.split(pair_delim) {

to avoid heap allocation

…rray

- Replace len==0 check with assert (delimiter array should never be empty) - Add comment explaining scalar expansion in columnar execution - Add unit test for delimiter extraction (single, multi-char, expanded scalar) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add multi-row test with default delimiters - Add multi-row test with custom delimiters (comma and equals) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace `offsets.last().unwrap()` with explicit `current_offset` tracking - Add table-driven unit tests covering s0-s6 Spark test cases + null input - Add multi-row test demonstrating Arrow MapArray internal structure - Import NullBuffer at module level for cleaner code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Documents current behavior and adds TODO for Spark's EXCEPTION default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Jefffrey · 2026-02-04T03:10:10Z

datafusion/spark/src/function/map/string_to_map.rs

+    }
+
+    fn name(&self) -> &str {
+        "string_to_map"


Is there a reference to this alias? As far as I can tell Spark only has str_to_map

You're right, I'll change to str_to_map.

Jefffrey · 2026-02-04T03:12:23Z

datafusion/spark/src/function/map/string_to_map.rs

+    };
+
+    // Process each row
+    let text_array = text_array


let text_array = as_string_array(text_array)?;

Easier downcasting: https://docs.rs/datafusion/latest/datafusion/common/cast/fn.as_string_array.html

However we need to consider that other string types exist such as LargeUtf8 and Utf8View

Jefffrey · 2026-02-04T03:14:20Z

datafusion/spark/src/function/map/string_to_map.rs

+        "Delimiter array should not be empty"
+    );
+
+    // In columnar execution, scalar delimiter is expanded to array to match batch size.


We can't assume this; for example this is a valid test case that will fail:

query ? SELECT string_to_map(col1, col2, col3) FROM (VALUES ('a=1,b=2', ',', '='), ('x#9', ',', '#'), (NULL, ',', '=')) AS t(col1, col2, col3); ---- {a: 1, b: 2} {x: 9} NULL

Delimiters can vary per row

We should either choose to support only scalar delimiters for now (look at invoke_with_args and how we can work with ColumnarValues directly) or need to ensure we respect per-row delimiters

Jefffrey · 2026-02-04T03:14:49Z

datafusion/spark/src/function/map/string_to_map.rs

+    // Test cases derived from Spark ComplexTypeSuite:
+    // https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
+    #[test]
+    fn test_string_to_map_cases() {


Is it possible to move all these test cases to SLTs?

Jefffrey · 2026-02-04T03:16:20Z

datafusion/spark/src/function/map/string_to_map.rs

+    let mut null_buffer = vec![true; num_rows];
+
+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {


If we decide to support per-row delimiters we'll need to consider their nullability; could consider using NullBuffer::union to build the final nullbuffer upfront once, though keep in mind we'll have up to 3 input arrays

Jefffrey · 2026-02-04T03:17:47Z

datafusion/spark/src/function/map/string_to_map.rs

+            keys_builder.append_value("");
+            values_builder.append_null();
+            current_offset += 1;
+            offsets.push(current_offset);


Have we considered using MapBuilder here?

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 3, 2026

unknowntpo and others added 6 commits February 3, 2026 10:12

refactor: move string_to_map from string to map module

48e9bb6

The function returns Map type so it belongs in the map module. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: move string_to_map test from spark/string to spark/map

36e1163

Follows the source code move in the previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add inline comments explaining string_to_map parsing logic

497aaed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: keep utils module private in map module

6e503ea

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix lint

f97c89e

unknowntpo force-pushed the feat-string-to-map-fn branch from 7fa0217 to f97c89e Compare February 3, 2026 02:12

refine test cases

0e9823c

hsiang-c reviewed Feb 3, 2026

View reviewed changes

dentiny reviewed Feb 3, 2026

View reviewed changes

unknowntpo and others added 11 commits February 3, 2026 18:00

refactor: avoid heap allocation in string_to_map pairs iteration

2148b7c

refactor: avoid heap allocation in string_to_map kv splitting

2affd3a

refactor: rename get_scalar_string to extract_delimiter_from_string_a…

2b202e2

…rray

test: add multi-row sqllogictests for string_to_map

27aba9f

- Add multi-row test with default delimiters - Add multi-row test with custom delimiters (comma and equals) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: merge multi-row test into table-driven tests

ccffbe0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: add duplicate key test case with LAST_WIN behavior

10f9c47

Documents current behavior and adds TODO for Spark's EXCEPTION default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: add assert for args length in string_to_map

48c8926

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: put helper at bottom

6af85b2

style: run cargo fmt

b025293

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Jefffrey reviewed Feb 4, 2026

View reviewed changes

	/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
	/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

	fn get_scalar_string(array: &ArrayRef) -> Result<String> {
	fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Conversation

unknowntpo commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Examples

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

unknowntpo Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dentiny Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

unknowntpo commented Feb 3, 2026 •

edited

Loading

unknowntpo Feb 3, 2026 •

edited

Loading

dentiny Feb 3, 2026 •

edited

Loading