feat: support Spark-compatible string_to_map function#20120
feat: support Spark-compatible string_to_map function#20120unknowntpo wants to merge 18 commits intoapache:mainfrom
string_to_map function#20120Conversation
Adds string_to_map (alias: str_to_map) function that creates a map
from a string by splitting on delimiters.
- Supports 1-3 args: text, pair_delim (default ','), key_value_delim (default ':')
- Returns Map<Utf8, Utf8>
- NULL input returns NULL
- Empty string returns {"": NULL} (Spark behavior)
- Missing key_value_delim results in NULL value
- Duplicate keys: last wins (LAST_WIN policy)
Test cases derived from Spark v4.0.0 ComplexTypeSuite.scala.
The function returns Map type so it belongs in the map module. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Follows the source code move in the previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7fa0217 to
f97c89e
Compare
| for row_idx in 0..num_rows { | ||
| if text_array.is_null(row_idx) { | ||
| null_buffer[row_idx] = false; | ||
| offsets.push(*offsets.last().unwrap()); |
There was a problem hiding this comment.
Will the last() call return None?
There was a problem hiding this comment.
no, offsets is initialized with one element 0
There was a problem hiding this comment.
No, last() will never return None here. The offsets vector is initialized with vec![0], so it always has at least one element before the loop starts.
I've refactor this and introduce a current_offset variable to avoid confusion.
| # Test cases derived from Spark test("StringToMap"): | ||
| # https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618 | ||
| # | ||
| # Note: Duplicate key handling uses LAST_WIN policy (not EXCEPTION which is Spark default) |
There was a problem hiding this comment.
| /// <https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map> | ||
| /// | ||
| /// Creates a map from a string by splitting on delimiters. | ||
| /// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String> |
There was a problem hiding this comment.
nit (I copied from the spark link above), should keep consistent with spark doc IMO
| /// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String> | |
| /// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String> |
There was a problem hiding this comment.
You're right, I'll change to str_to_map.
| } | ||
|
|
||
| fn string_to_map_inner(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
| let text_array = &args[0]; |
There was a problem hiding this comment.
Curious should we check (or assert, if the signature already guards against bad usage) arg count cannot be >= 4? And add unit test?
There was a problem hiding this comment.
Self {
signature: Signature::one_of(
vec![
// string_to_map(text)
TypeSignature::String(1),
// string_to_map(text, pairDelim)
TypeSignature::String(2),
// string_to_map(text, pairDelim, keyValueDelim)
TypeSignature::String(3),
],
Volatility::Immutable,
),
aliases: vec![String::from("str_to_map")],
}
signature make sure that this will not happened.
but I've added assertion for defense.
| for row_idx in 0..num_rows { | ||
| if text_array.is_null(row_idx) { | ||
| null_buffer[row_idx] = false; | ||
| offsets.push(*offsets.last().unwrap()); |
There was a problem hiding this comment.
no, offsets is initialized with one element 0
| } | ||
|
|
||
| /// Extract scalar string value from array (assumes all values are the same) | ||
| fn get_scalar_string(array: &ArrayRef) -> Result<String> { |
There was a problem hiding this comment.
| fn get_scalar_string(array: &ArrayRef) -> Result<String> { | |
| fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> { |
Do you think it matches the intention (since you clearly said it's delim parsing at L216)?
There was a problem hiding this comment.
Good catch, renamed to extract_delimiter_from_string_array with proper testing.
| ) | ||
| })?; | ||
|
|
||
| if string_array.len() == 0 { |
There was a problem hiding this comment.
curious in which case will the len be 0? I thought we should assert the len 😲
There was a problem hiding this comment.
Good catch, I've added a assertion here.
| // "a:" -> kv = ["a", ""] -> key="a", value=Some("") | ||
| // ":1" -> kv = ["", "1"] -> key="", value=Some("1") | ||
| let kv: Vec<&str> = pair.splitn(2, &kv_delim).collect(); | ||
| let key = kv[0]; |
There was a problem hiding this comment.
let mut iter = pair.splitn(2, kv_delim);
let key = iter.next().unwrap_or("");
let value = iter.next().unwrap_or(None);so we don't need heap allocation for vector?
|
|
||
| // Split text into key-value pairs using pair_delim. | ||
| // Example: "a:1,b:2" with pair_delim="," -> ["a:1", "b:2"] | ||
| let pairs: Vec<&str> = text.split(&pair_delim).collect(); |
There was a problem hiding this comment.
for pair in text.split(pair_delim) {to avoid heap allocation
- Replace len==0 check with assert (delimiter array should never be empty) - Add comment explaining scalar expansion in columnar execution - Add unit test for delimiter extraction (single, multi-char, expanded scalar) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add multi-row test with default delimiters - Add multi-row test with custom delimiters (comma and equals) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace `offsets.last().unwrap()` with explicit `current_offset` tracking - Add table-driven unit tests covering s0-s6 Spark test cases + null input - Add multi-row test demonstrating Arrow MapArray internal structure - Import NullBuffer at module level for cleaner code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documents current behavior and adds TODO for Spark's EXCEPTION default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| } | ||
|
|
||
| fn name(&self) -> &str { | ||
| "string_to_map" |
There was a problem hiding this comment.
Is there a reference to this alias? As far as I can tell Spark only has str_to_map
There was a problem hiding this comment.
You're right, I'll change to str_to_map.
| }; | ||
|
|
||
| // Process each row | ||
| let text_array = text_array |
There was a problem hiding this comment.
let text_array = as_string_array(text_array)?;Easier downcasting: https://docs.rs/datafusion/latest/datafusion/common/cast/fn.as_string_array.html
However we need to consider that other string types exist such as LargeUtf8 and Utf8View
| "Delimiter array should not be empty" | ||
| ); | ||
|
|
||
| // In columnar execution, scalar delimiter is expanded to array to match batch size. |
There was a problem hiding this comment.
We can't assume this; for example this is a valid test case that will fail:
query ?
SELECT string_to_map(col1, col2, col3)
FROM (VALUES ('a=1,b=2', ',', '='), ('x#9', ',', '#'), (NULL, ',', '=')) AS t(col1, col2, col3);
----
{a: 1, b: 2}
{x: 9}
NULL- Delimiters can vary per row
We should either choose to support only scalar delimiters for now (look at invoke_with_args and how we can work with ColumnarValues directly) or need to ensure we respect per-row delimiters
| // Test cases derived from Spark ComplexTypeSuite: | ||
| // https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618 | ||
| #[test] | ||
| fn test_string_to_map_cases() { |
There was a problem hiding this comment.
Is it possible to move all these test cases to SLTs?
| let mut null_buffer = vec![true; num_rows]; | ||
|
|
||
| for row_idx in 0..num_rows { | ||
| if text_array.is_null(row_idx) { |
There was a problem hiding this comment.
If we decide to support per-row delimiters we'll need to consider their nullability; could consider using NullBuffer::union to build the final nullbuffer upfront once, though keep in mind we'll have up to 3 input arrays
| keys_builder.append_value(""); | ||
| values_builder.append_null(); | ||
| current_offset += 1; | ||
| offsets.push(current_offset); |
Which issue does this PR close?
datafusion-sparkSpark Compatible Functions #15914Rationale for this change
string_to_map(aliasstr_to_map) creates a map by splitting a string into key-value pairs using delimiters.What changes are included in this PR?
string_to_mapfunction indatafusion-sparkcratestring_to_map(text, [pairDelim], [keyValueDelim]) -> Map<String, String>text: The input stringpairDelim: Delimiter between key-value pairs (default:,)keyValueDelim: Delimiter between key and value (default::)str_to_mapfunction/map/module (returns Map type)Examples
Are these changes tested?
test_files/spark/map/string_to_map.slt, test cases derived from Spark test("StringToMap"):Are there any user-facing changes?
Yes, the
string_to_mapandstr_to_mapfunctions are now available in Spark-compatible SQL.