Skip to content

MAP Type Ergonomics & map_unnest() #327

@jarodmeng

Description

@jarodmeng

Current State

MAP types are currently supported in data parsing (see R/presto.field.R lines 107-129 for MAP type handling), but there are significant usability gaps:

  1. No convenient unnesting function: Users must manually write SQL to unnest MAP columns into key-value pairs
  2. Limited MAP manipulation: While MAP types can be parsed, there's no high-level R function to work with them
  3. No examples or documentation: Lack of clear examples showing how to work with MAP types in RPresto

Current MAP Support:

  • MAP types are parsed correctly in R/presto.field.R (lines 107-129)
  • MAP data is organized in organize_map_type_data() (lines 353-383)
  • MAP keys and values are extracted and stored as lists in R

What's Missing:

  • map_unnest() function to convert MAP columns to key-value pairs
  • Integration with dplyr workflows
  • Documentation and examples

Detailed Improvements

1. Create map_unnest() Helper Function

Purpose: Convert MAP columns into key-value pairs, similar to how presto_unnest() works for arrays.

Function Signature:

map_unnest(data, cols, ..., key_to = NULL, value_to = NULL)

Parameters:

  • data: A tbl_presto object
  • cols: Column(s) to unnest (can be multiple MAP columns)
  • key_to: Name for the key column (default: "{col}_key")
  • value_to: Name for the value column (default: "{col}_value")

Implementation Steps:

  1. Create new file R/map_unnest.R:

    • Follow the pattern from R/presto_unnest.R
    • Create map_unnest() generic and map_unnest.tbl_presto() method
    • Create lazy_map_unnest_query class similar to lazy_unnest_query
  2. SQL Generation:

    • Use MAP_ENTRIES() function (from Feature 2) to convert MAP to ARRAY(ROW(key, value))
    • Generate SQL: CROSS JOIN UNNEST(MAP_ENTRIES(map_col)) AS t(key, value)
    • Handle multiple MAP columns with multiple CROSS JOIN clauses
  3. SQL Example:

    -- Single MAP column:
    SELECT * FROM table_name
    CROSS JOIN UNNEST(MAP_ENTRIES(map_col)) AS t(key, value)
    
    -- Multiple MAP columns:
    SELECT * FROM table_name
    CROSS JOIN UNNEST(MAP_ENTRIES(map_col1)) AS t1(key1, value1),
         UNNEST(MAP_ENTRIES(map_col2)) AS t2(key2, value2)
  4. Implementation Pattern (similar to presto_unnest()):

    map_unnest <- function(data, cols, ..., key_to = NULL, value_to = NULL) {
      UseMethod("map_unnest")
    }
    
    map_unnest.tbl_presto <- function(data, cols, ..., key_to = NULL, value_to = NULL) {
      # Column selection (similar to presto_unnest lines 71-96)
      # Store in lazy_map_unnest_query structure
      # Return modified tbl_presto
    }
    
    # SQL rendering:
    sql_render.map_unnest_query <- function(query, con, ...) {
      # Generate CROSS JOIN UNNEST(MAP_ENTRIES(...)) SQL
    }
  5. Handle Column Naming:

    • If key_to or value_to not provided, use defaults: "{col}_key" and "{col}_value"
    • For multiple columns, generate unique names: "{col1}_key", "{col1}_value", "{col2}_key", etc.
    • Support names_sep parameter (similar to Feature 1) for consistency

Example Usage:

# Single MAP column
tbl(con, "user_preferences") %>%
  map_unnest(preferences) %>%
  collect()
# Result: columns include original columns + preferences_key, preferences_value

# Multiple MAP columns
tbl(con, "data") %>%
  map_unnest(c(map1, map2), key_to = c("k1", "k2"), value_to = c("v1", "v2")) %>%
  collect()

# With dplyr operations
tbl(con, "events") %>%
  map_unnest(metadata) %>%
  filter(value > 100) %>%
  group_by(key) %>%
  summarize(count = n())

Edge Cases to Handle:

  • NULL maps (should produce no rows)
  • Empty maps (should produce no rows)
  • Maps with NULL keys or values
  • Multiple MAP columns with different key/value types
  • Very large maps

2. Enhance MAP Operations Integration

Dependencies: Requires Feature 2 MAP function translations to be complete.

Integration Points:

  1. Ensure MAP_ENTRIES() works correctly (from Feature 2)
  2. Test map_unnest() with all MAP manipulation functions:
    • map_keys(), map_values() for extracting keys/values
    • map_filter() for filtering before unnesting
    • map_concat() for combining maps before unnesting

Example Workflows:

# Filter map before unnesting
tbl(con, "data") %>%
  mutate(filtered_map = map_filter(original_map, (k, v) -> v > 10)) %>%
  map_unnest(filtered_map) %>%
  collect()

# Extract keys only
tbl(con, "data") %>%
  mutate(keys = map_keys(my_map)) %>%
  presto_unnest(keys) %>%
  collect()

3. Add MAP Examples and Documentation

Documentation to Add:

  1. Function Documentation (R/map_unnest.R):

    • Add @description, @param, @return, @examples sections
    • Include examples showing common use cases
    • Document edge cases and limitations
  2. Vignette or Examples:

    • Create examples in existing vignettes or new vignettes/map-types.Rmd
    • Show common MAP operations:
      • Unnesting MAP columns
      • Filtering maps
      • Combining maps
      • Working with nested MAP types (MAP within MAP)
    • Include real-world use cases
  3. README Updates:

    • Add MAP type section to README
    • Show quick examples of map_unnest()

Example Vignette Content:

# Working with MAP Types in RPresto

## Basic MAP Unnesting

# Create a table with MAP column
DBI::dbExecute(con, "
  CREATE TABLE user_scores AS
  SELECT 
    user_id,
    MAP(ARRAY['math', 'science', 'english'], 
        ARRAY[95, 87, 92]) AS scores
  FROM (VALUES (1), (2)) AS t(user_id)
")

# Unnest the MAP
tbl(con, "user_scores") %>%
  map_unnest(scores, key_to = "subject", value_to = "score") %>%
  collect()

## Filtering Maps Before Unnesting

tbl(con, "user_scores") %>%
  mutate(high_scores = map_filter(scores, (k, v) -> v > 90)) %>%
  map_unnest(high_scores) %>%
  collect()

## Working with Multiple MAP Columns

tbl(con, "data") %>%
  map_unnest(c(metadata, tags)) %>%
  collect()

Files to Create/Modify

New Files

  • R/map_unnest.R:
    • Main implementation file
    • Should follow structure similar to R/presto_unnest.R
    • Include:
      • map_unnest() generic
      • map_unnest.tbl_presto() method
      • lazy_map_unnest_query class
      • op_vars.lazy_map_unnest_query() method
      • op_grps.lazy_map_unnest_query() method
      • op_sort.lazy_map_unnest_query() method
      • op_frame.lazy_map_unnest_query() method
      • sql_build.lazy_map_unnest_query() method
      • sql_render.map_unnest_query() method

Modified Files

  • R/dbplyr-sql.R:
    • Ensure MAP_ENTRIES translation exists (from Feature 2)
    • Verify all MAP functions work correctly

Test Files

  • tests/testthat/test-map_unnest.R:
    • Comprehensive test suite for map_unnest()
    • Test single and multiple MAP columns
    • Test with dplyr operations (filter, group_by, summarize, etc.)
    • Test edge cases (NULL maps, empty maps, etc.)
    • Test integration with MAP manipulation functions

Documentation Files

  • vignettes/map-types.Rmd (new):
    • Comprehensive guide to working with MAP types
    • Examples and use cases
  • man/map_unnest.Rd:
    • Function documentation (auto-generated from roxygen)

Implementation Checklist

Phase 1: Core Implementation

  • Create R/map_unnest.R file structure
  • Implement map_unnest() generic
  • Implement map_unnest.tbl_presto() method
  • Create lazy_map_unnest_query class
  • Implement op_vars.lazy_map_unnest_query()
  • Implement op_grps.lazy_map_unnest_query()
  • Implement op_sort.lazy_map_unnest_query()
  • Implement op_frame.lazy_map_unnest_query()
  • Implement sql_build.lazy_map_unnest_query()
  • Implement sql_render.map_unnest_query()

Phase 2: Column Handling

  • Support single MAP column
  • Support multiple MAP columns
  • Implement key_to parameter
  • Implement value_to parameter
  • Handle default column naming
  • Support names_sep parameter (for consistency with Feature 1)

Phase 3: SQL Generation

  • Generate correct SQL for single MAP column
  • Generate correct SQL for multiple MAP columns
  • Handle subqueries correctly (wrap in parentheses when needed)
  • Handle CTEs correctly
  • Ensure proper column quoting

Phase 4: Integration

  • Test with filter()
  • Test with group_by() and summarize()
  • Test with arrange()
  • Test with window functions
  • Test with CTEs
  • Test with MAP manipulation functions (map_keys, map_values, etc.)

Phase 5: Edge Cases

  • Handle NULL maps
  • Handle empty maps
  • Handle maps with NULL keys
  • Handle maps with NULL values
  • Handle very large maps
  • Handle nested MAP types (MAP within MAP)

Phase 6: Testing

  • Write unit tests for basic functionality
  • Write integration tests with live connections
  • Write tests for edge cases
  • Write tests for dplyr integration
  • Ensure all tests pass

Phase 7: Documentation

  • Add roxygen documentation to map_unnest()
  • Add examples to function documentation
  • Create vignettes/map-types.Rmd (or add to existing vignette)
  • Update README with MAP examples
  • Ensure documentation builds correctly

Testing Requirements

Unit Tests

  1. Basic Functionality:

    • Single MAP column unnesting
    • Multiple MAP column unnesting
    • Custom key_to and value_to names
    • Default column naming
  2. SQL Generation:

    • Verify SQL output is correct
    • Test with subqueries
    • Test with CTEs
    • Test column quoting
  3. Edge Cases:

    • NULL maps
    • Empty maps
    • Maps with NULL keys/values
    • Multiple MAP columns with different types

Integration Tests

  1. dplyr Integration:

    • map_unnest() + filter()
    • map_unnest() + group_by() + summarize()
    • map_unnest() + arrange()
    • map_unnest() + window functions
    • map_unnest() + CTEs
  2. MAP Function Integration:

    • map_filter() + map_unnest()
    • map_keys() + presto_unnest()
    • map_values() + presto_unnest()
    • map_concat() + map_unnest()
  3. Real Data Tests:

    • Test with actual Presto tables containing MAP columns
    • Test with various MAP key/value types
    • Test performance with large maps

Test Pattern

test_that("map_unnest works with single MAP column", {
  conn <- setup_live_connection()
  
  # Create test table
  DBI::dbExecute(conn, "
    CREATE TABLE test_map AS
    SELECT MAP(ARRAY['a', 'b'], ARRAY[1, 2]) AS my_map
  ")
  on.exit(DBI::dbRemoveTable(conn, "test_map"), add = TRUE)
  
  # Test unnesting
  result <- tbl(conn, "test_map") %>%
    map_unnest(my_map) %>%
    collect()
  
  expect_equal(nrow(result), 2L)
  expect_equal(names(result), c("my_map", "my_map_key", "my_map_value"))
  expect_equal(sort(result$my_map_key), c("a", "b"))
})

Acceptance Criteria

  1. map_unnest() function exists and works for single MAP columns
  2. map_unnest() supports multiple MAP columns
  3. key_to and value_to parameters work correctly
  4. ✅ Default column naming works as expected
  5. ✅ SQL generation is correct for all cases
  6. ✅ Works with all dplyr operations (filter, group_by, summarize, etc.)
  7. ✅ Works with CTEs and subqueries
  8. ✅ Handles edge cases (NULL, empty maps) correctly
  9. ✅ Integration with MAP manipulation functions works
  10. ✅ Comprehensive test suite passes
  11. ✅ Documentation is complete with examples
  12. ✅ Vignette or examples demonstrate common use cases

Dependencies

  • Feature 2: Must have MAP_ENTRIES translation implemented
  • Feature 2: All MAP function translations should be complete
  • Can be developed in parallel but requires Feature 2 to be complete for full functionality

Related

Part of RPresto Improvement Plan - Feature 3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions