Add support for Azure Blob Storage and ADLS Gen2 in Hive connector#1
Open
mehradpk wants to merge 565 commits intonishithakbhaskaran:hadoop-upgrade-3.4.1from
Open
Add support for Azure Blob Storage and ADLS Gen2 in Hive connector#1mehradpk wants to merge 565 commits intonishithakbhaskaran:hadoop-upgrade-3.4.1from
mehradpk wants to merge 565 commits intonishithakbhaskaran:hadoop-upgrade-3.4.1from
Conversation
imjalpreet
reviewed
May 13, 2025
imjalpreet
left a comment
There was a problem hiding this comment.
@mehradpk Thank you for the PR, can you raise a draft PR from your branch in OSS as well? I want to see if there are any test failures.
@nishithakbhaskaran can you take a first pass at reviewing this?
49d3ff3 to
61bd31a
Compare
Owner
|
@mehradpk Changes looks good. |
e0009df to
908fc4e
Compare
9a3edd5 to
75ff57c
Compare
1eb9af2 to
444a8f4
Compare
444a8f4 to
17b66a1
Compare
79b9fb2 to
cb0c461
Compare
64eff1c to
b221385
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff cb0c461...7b41678.
|
f98eb88 to
9025dd1
Compare
9025dd1 to
a65a447
Compare
Summary: finishCpu += operator.getFinishCpu().roundTo(NANOSECONDS); getFinishCpu has underlying issue that causes finisheCpu to overflow thus resulting in a negative result. This causes exception while reporting query completion event leading to stats not being reported. Fix it by setting finishCpu to max value when it overflows # Release Note ``` == NO RELEASE NOTE == ```
…mit metadata for query event listeners (prestodb#26331) Summary: Currently, the `Input` and `Output` query metadata classes retain two source of connector-specific information that can be useful for reporting via an `EventListener`: ``` Optional<Object> connectorInfo; String serializedCommitOutput; ``` * `connectorInfo` can be cast back to the correct type in an `EventListener` implementation, allowing rich access to the underlying data * `serializedCommitOutput` however, is serialized in a given format by the `ConnectorCommitHandle` implementation, which makes it difficult to correctly represent the reporting requirements in an EventListener (which may need correlation with data in the `connectorInfo` result). For example, `HiveCommitHandle` retains the lastDataCommitTime for each partition in a simple array associated with the table name, where the partition names are retained in the `HiveInputInfo` instance carried through in connectorInfo. For these times to be mapped back to individual partitions, the entries must be in the exact same order as the entries in HiveInputInfo. This change simply replaces the `serializedCommitOutput` property with an `Optional<Object>` instance, providing parity with the `connectorInfo`, and allowing `EventListener` implementations to cast the commit handle back to the correct type for richer access to the underlying data. Differential Revision: D84382446 ## Release Notes ``` == RELEASE NOTES == SPI Changes * Replaces the ``String serializedCommitOutput`` argument with ``Optional<Object> commitOutput`` in the ``com.facebook.presto.spi.eventlistener.QueryInputMetadata`` and ``com.facebook.presto.spi.eventlistener.QueryOutputMetadata`` constructors * Adds ``getCommitOutputForRead()`` and ``getCommitOutputForWrite()`` methods to ``ConnectorCommitHandle``, and deprecates the existing ``getSerializedCommitOutputForRead()`` and ``getSerializedCommitOutputForWrite()`` methods ```
…restodb#26557) Summary: Remove the uninitialized bytes in binaryData, so we can reduce the binary response size. {F1983340076} Differential Revision: D85720910 ### RELEASE NOTES ### ``` == RELEASE NOTES == General Changes * Replace the java standard base64 encoder with BaseEncoding from Guava ```
…nicode escapes (prestodb#26443) Summary: Modified `ExpressionFormatter.formatStringLiteral()` to preserve common whitespace characters (newlines, tabs, carriage returns) in their literal form rather than converting them to Unicode escape sequences (e.g., `\000A` for newline). This change improves SQL standard compliance and fixes issues with embedded code (like Python UDF) and regex patterns that require proper whitespace handling. Differential Revision: D85380265 ``` == NO RELEASE NOTE == ```
## Description Current code will try to add a round robin local exchange below the merge join node, which will break the sorted property of the input. In this PR, we fixed it. ## Motivation and Context Bug fix ## Impact Bug fix ## Test Plan Unit test ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ```
…stodb#26403) ## Summary This PR introduces sorted exchange functionality to Presto, enabling efficient sort-merge joins by allowing data to be sorted during shuffle operations rather than requiring separate sort steps. This optimization eliminates redundant sorting, reduces memory pressure, and improves query performance for distributed joins and aggregations that require sorted inputs. ## Motivation Currently, when Presto needs to perform a sort-merge join in a distributed query, it must: 1. Shuffle data across workers (ExchangeNode) 2. Explicitly sort the shuffled data (SortNode) This approach is inefficient because sorting happens as a separate operation after data movement. By pushing the sort operation into the exchange itself, we can sort data during the shuffle, eliminating the redundant SortNode and improving overall query performance. ## High-Level Changes 1. Core Infrastructure (3161b24) - Add `orderingScheme` field to `PlanFragment` class (Java) - Add `outputOrderingScheme` field to C++ PlanFragment protocol - Implement JSON serialization/deserialization for C++ integration - Update `PrestoToVeloxQueryPlan.cpp` to consume ordering scheme and convert to sorting keys - Update all `PlanFragment` constructor call sites to support the new field 2. Planner Support (130b14f) - Extend `ExchangeNode` to support SORTED partition type - Update `BasePlanFragmenter` to populate and propagate orderingScheme between fragments - Add `PlanFragmenterUtils` support for sorted exchanges - Enhance `PlanPrinter` to display sorted exchange information in EXPLAIN output 3. Optimizer Rule (6951cab) - Introduce SortedExchangeRule optimizer that identifies and transforms Sort→Exchange patterns - Add `sorted_exchange_enabled` session property (experimental, default: false) - Add `optimizer.experimental.sorted-exchange-enabled` configuration property - Integrate into optimizer pipeline alongside existing join optimizers - Only applies to REMOTE REPARTITION exchanges - Validates ordering variables are available in exchange output 4. Spark Integration (960bc93) - Update `AbstractPrestoSparkQueryExecution` to handle sorted exchanges - Add `MutablePartitionIdOrdering` class to track partition ordering in Spark - Update `PrestoSparkRddFactory` to preserve sort order during shuffles - Enable Spark-based queries to leverage sorted exchanges ## Plan Transformation Example Before: ``` SortNode(orderBy: [a, b]) └─ ExchangeNode(type: REPARTITION, scope: REMOTE) ``` After: ``` ExchangeNode(type: REPARTITION, scope: REMOTE, orderingScheme: [a, b]) ``` ## Configuration The feature is controlled by: - Session property: enable_sorted_exchanges (experimental, default: false) - Config property: experimental.optimizer.sorted-exchange-enabled ## Testing - Added TestSortedExchangeRule with test cases covering various scenarios ## Performance Benefits - Reduced sorting overhead: Eliminates redundant SortNode operations - Lower memory usage: Avoids buffering data for explicit sorting Backward Compatibility - Feature is disabled by default (experimental flag) - All existing queries continue to work without modification - No breaking changes to public APIs - Graceful degradation when feature is disabled ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == General Changes * Add experimental support for sorted exchanges to improve sort-merge join performance. When enabled via the `sorted_exchange_enabled` session property or `experimental.optimizer.sorted-exchange-enabled` configuration property, the query planner will push sort operations into exchange nodes, eliminating redundant sorting steps and reducing memory usage for distributed queries with sort-merge joins. This feature is disabled by default.
Summary: Impl sort key for LocalShuffleWriter Differential Revision: D86322593
…restodb#27009) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. CVE-2025-13465 <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/prestodb/presto/network/alerts). </details> ``` == RELEASE NOTES == Security Changes * Upgrade lodash from 4.17.21 to 4.17.23 to address `CVE-2025-13465 <https://github.com/advisories/GHSA-xxjr-mmjv-4gpg>`_. ``` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…#26718) ## Description Added new documentation explaining how to use the Presto C++ engine. The documentation provides step-by-step instructions for configuring, and running the Presto C++ worker ## Motivation and Context There was no consolidated or beginner-friendly documentation for Presto C++ in the open-source project. Users often had difficulty understanding how to build and run the C++ worker, what dependencies were required, and how it integrates with a Presto coordinator. ## Impact There is no performance impact. ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ```
92776b9 to
971afcd
Compare
## Description Upgrade postgresql to version 42.7.9 ## Motivation and Context Using a more recent version helps avoid potential vulnerabilities and ensures we aren't relying on outdated or unsupported code. ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ```
…ons to sidecar for expression optimization (prestodb#27043) ## Description Avoid sending aggregate and window functions to sidecar for expression optimization. ## Motivation and Context Encountered while investigating prestodb#26920. The bug reported in the issue is different but the general idea is we should avoid sending aggregate and window functions to sidecar as they cannot be constant folded. The failing queries in the issue are added as test cases. ## Impact No impact. ## Test Plan Unit tests, CI. ``` == NO RELEASE NOTE == ```
## Description This PR adds subfield pushdown optimization for the `cardinality()` function in Presto. When enabled, this optimization allows the query engine to skip reading map keys/values or array elements when only the cardinality (count) of these collections is needed. This PR contains coordinator-side changes only; the corresponding worker-side changes will be added separately to the C++ worker. Since this feature is not yet fully tested end-to-end with the worker, the session property is disabled by default. Additionally, this implementation takes a conservative approach to subfield pushdown for cardinality: if a column already has other subfields being accessed (e.g., `features['key']`), we skip adding the structure-only subfield for cardinality to avoid potential correctness issues. Key Changes: 1. New StructureOnly PathElement (Subfield.java): Introduced a new path element type represented as [$] that indicates only the structural metadata (size/count) is needed, not the actual content 2. SubfieldTokenizer Update: Added parsing support for the $ subscript pattern in subfield paths 3. FunctionResolution: Added isCardinalityFunction() method to identify cardinality function calls 4. PushdownSubfields Optimizer: Extended the subfield extraction logic to recognize cardinality() calls on maps and arrays, generating [$] subfield hints that downstream readers can use to skip content 5. Session/Config Properties: Added pushdown_subfields_for_cardinality configuration option (disabled by default) ## Motivation and Context When queries only need to know the size of a map or array (e.g., `SELECT cardinality(features) FROM table or WHERE cardinality(tags) > 10), there's no need to read all the keys/values or both. This optimization helps reduce shuffles improve the query performance. ## Impact - Performance: Reduces I/O and deserialization overhead for queries using cardinality() on maps/arrays - Backward Compatible: Feature is disabled by default via optimizer.pushdown-subfield-for-cardinality config - No Breaking Changes: Existing behavior is preserved when the feature is disabled - Added a new session property `pushdown-subfield-for-cardinality` ## Test Plan Added comprehensive unit tests in TestHiveLogicalPlanner.java covering: - Simple cardinality pushdown for MAP - Verifies cardinality(x) generates x[$] subfield - Cardinality pushdown for ARRAY - Verifies array cardinality generates correct subfield - Cardinality in WHERE clause - Tests WHERE cardinality(features) > 10 - Cardinality in aggregation - Tests AVG(cardinality(data)) - Multiple cardinalities - Tests multiple cardinality calls in same query - Cardinality with complex expressions - Tests cardinality(tags) * 2 - Cardinality on nested structures - Tests transform(arr_of_maps, m -> cardinality(m)) - Cardinality combined with subscript access - Verifies that when both cardinality(features) and features['key'] are used, the specific subscript takes precedence (avoiding redundant structure-only reads) ## Contributor checklist - [x] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [x] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ```
…restodb#27044) ## Description For remote functions, sometimes we want to limit the concurrency to avoid throttling the remote service. In this PR, I added session properties to set the number of tasks for a remote projection, so the plan will be like: scan -> remote exchange (with specified number of tasks) -> remote project node -> remote exchange -> output The remote project will run in a separate stage. There are two session properties, `remote_function_fixed_parallelism_task_count` to specify how many tasks to use `remote_function_names_for_fixed_parallelism` to specify the pattern of remote function names to match. ## Motivation and Context As in description ## Impact To control the number of tasks for a remote project node ## Test Plan unit tests and local end to end test ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == General Changes * Add options to control the number of tasks for remote project node ``` ## Summary by Sourcery Add configurable fixed-parallelism support for remote function projections and wire it through planning, partitioning, and session properties. New Features: - Introduce session and config properties to control fixed parallelism for selected remote functions via regex-matched names and an optional task count. - Extend exchange planning to insert bounded round-robin remote exchanges around qualifying remote project nodes based on the configured properties. Enhancements: - Augment system partitioning handles and exchange nodes to carry an optional partition count for fixed distributions and honor it when selecting nodes. Tests: - Add planner and configuration tests covering regex matching behavior for remote-function fixed parallelism and property mappings for the new optimizer settings.
## Description The Provisio plugin dumps all the native plugins under `native-plugin/` and not` native-plugins/`. ## Motivation and Context See attached screenshot for <img width="416" height="249" alt="Screenshot 2026-01-29 at 10 23 38 AM" src="https://github.com/user-attachments/assets/8edf2856-d61d-4b0c-8d27-20712c0ad044" /> ## Impact No user impact ## Test Plan Docs only change ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ```
…restodb#27050) ## Description Due to Iceberg issue apache/iceberg#15128, using a binary type as a partition column may cause incorrect calculation of partition bounds in the generated manifest files when deleting data files. This can lead to incorrect results in subsequent queries. Therefore, we temporarily disables metadata deletion and filter thoroughly pushdown for varbinary columns. This restrict can be lifted once the Iceberg issue is resolved. ## Motivation and Context Fix the bug when use varbinary columns as partition columns in Iceberg ## Impact This change is not visible to users. ## Test Plan - Newly added test case in `IcebergDistributedTestBase.testPartitionedByVarbinaryType` through `@DataProvider`, which would explicitly fail without this fix. ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Guard Iceberg plan optimization from enforcing metadata constraints on VARBINARY-partitioned columns and strengthen test coverage for varbinary partitioning behavior. Bug Fixes: - Avoid pushing down column constraints into Iceberg partition specs for VARBINARY columns to prevent incorrect metadata-based deletions and query results when varbinary is used as a partition key. Tests: - Extend the varbinary partitioning integration test to cover multiple insert value orderings and updated expected partition counts via a TestNG data provider.
…in AddLocalExchanges (prestodb#26960) We observed that the use of parent preference in AddLocalExchanges can limit parallelism when the cardinality of the partition column of parent preference is low. In a setup where a query is allowed to use many cores, limiting the parallelism significantly affect the query latency. More details can be found in prestodb#26961. This PR makes three changes: * This PR introduces a new feature config `localExchangeParentPreferenceStrategy` that has three values: ALWAYS, NEVER, and AUTOMATIC. The default value is ALWAYS (i.e., current behavior). * This PR makes AddLocalExchanges to use parent preference according to the localExchangeParentPreferenceStrategy. When localExchangeParentPreferenceStrategy is ALWAYS, it always uses parent preference. When localExchangeParentPreferenceStrategy is NEVER, it always not uses parent preference. When localExchangeParentPreferenceStrategy is AUTOMATIC, it uses parent preference only when the estimated cardinality is larger than the task concurrency. (If estimated stats is not available, parent preference is not used.) - Notice that the estimated stats is only calculated when localExchangeParentPreferenceStrategy is AUTOMATIC. * This PR adds unit tests of the new config and the change to local-exchange. ## Description <!---Describe your changes in detail--> ## Motivation and Context <!---Why is this change required? What problem does it solve?--> <!---If it fixes an open issue, please link to the issue here.--> ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Introduce a configurable strategy for using parent preferences in AddLocalExchanges and make local exchange partitioning for aggregations cost-aware based on estimated cardinality and task concurrency. New Features: - Add a local_exchange_parent_preference_strategy session/feature config to control how local exchanges use parent partitioning preferences with options ALWAYS, NEVER, and AUTOMATIC. Enhancements: - Update AddLocalExchanges to optionally use stats-based decisions when applying parent partitioning preferences for aggregation local exchanges, leveraging the existing stats calculator. - Wire the stats calculator into AddLocalExchanges through PlanOptimizers to enable precomputation of plan statistics when the AUTOMATIC strategy is selected. Tests: - Add planner tests validating local exchange behavior under ALWAYS, NEVER, and AUTOMATIC parent preference strategies and different task concurrency settings. - Extend FeaturesConfig tests to cover default and explicit mappings for the new local_exchange_parent_preference_strategy config.
## Description Remove unused code in `presto-hive-metastore` module ## Motivation and Context Remove unused code in `presto-hive-metastore` module ## Impact Maintainance ## Test Plan None ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Enhancements: - Clean up the in-memory caching Hive metastore by removing an unused method for invalidating stale partitions.
## Description Earlier the Iceberg connector did not get linked to a valid page, and this change fixes the issue by correctly mapping it to the Iceberg connector documentation page. ## Motivation and Context The previous documentation link for the Iceberg connector was invalid, which could confuse users trying to navigate to the correct connector documentation. This change ensures the link points to the correct and valid page. ## Impact Documentation-only change. No public API, user-facing behavior, or performance impact. ## Test Plan Verified the updated link points to the correct Iceberg connector documentation page. ## Contributor checklist - [x] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [x] PR description addresses the issue accurately and concisely. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] Adequate tests were added if applicable. - [x] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ```
…er writer (prestodb#26989) ## Description For INSERT/CTAS operations on Iceberg tables with a large number of partitions, the partition count per writer can far exceed 100. In such cases, we may want the operation to succeed rather than fail fast—for example, when the data volume is known to be small; or when we are willing to trade off speed for lower memory usage by reducing the configuration values of `parquet_writer_block_size` or `orc_optimized_writer_max_stripe_size`. Currently, the only way to configure this limit is through the connector property `iceberg.max-partitions-per-writer`, which requires a cluster restart to take effect and applies globally to all SQLs and sessions. This PR introduces the corresponding iceberg connector session property `max_partitions_per_writer` to set the max partitions per writer. This provides a much lighter and more flexible approach, allowing adjustments to take effect immediately without a restart. ## Motivation and Context Provide per-session or even per-statement configuration to adjust insert behavior and avoid failures. ## Impact Users can now set the max limit of partitions per writer via the `SET SESSION` statement. ## Test Plan - Newly added test cases to show the effect of the session property in CTAS/INSERT statement. ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ```
``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Chores: - Update the Velox submodule reference used by presto-native-execution to the latest desired commit. --------- Co-authored-by: Ping Liu <lpingbj@cn.ibm.com> Co-authored-by: Christian Zentgraf <czentgr@us.ibm.com>
…riter (prestodb#27054) Summary: Session property to control the file size for presto writers Differential Revision: D91361183 ## Summary by Sourcery New Features: - Introduce a NATIVE_MAX_TARGET_FILE_SIZE session property to control when writers roll over to a new output file based on size. ### Release Notes ``` == RELEASE NOTES == Prestissimo (Native Execution) Changes * Add ``native_max_target_file_size`` session property to control the maximum target file size for writers. When a file exceeds this size during writing, the writer will close the current file and start writing to a new file. ```
``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Chores: - Update the Velox submodule reference used by presto-native-execution.
Summary: Fix unnecessary copies in the Presto HTTP module: - Use std::move() for shared_ptr, SSLContextPtr, and callback assignments - Use const reference for path variable to avoid copy from getPath() These changes eliminate unnecessary copy operations and improve performance. ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Address performance-related cleanups in the Presto HTTP client and server by eliminating unnecessary copies of objects and strings. Enhancements: - Move HTTP client and server callbacks, timers, and context objects instead of copying to avoid redundant allocations and ownership transfers. - Bind the HTTP request path as a const reference rather than copying the string when dispatching request handlers.
…TRY() (prestodb#26976) Add a session property to control whether TRY() function can catch errors from remote function execution. This allows users to enable error catching for remote functions on a per-session basis. Changes: - Add TRY_CATCH_REMOTE_FUNCTION_ERRORS constant to SystemSessionProperties - Add isTryCatchRemoteFunctionErrors() to FeaturesConfig with default false - Add isTryCatchRemoteFunctionErrorsEnabled() getter for session access - Add unit test for the new config property ``` == NO RELEASE NOTE == ```
…restodb#27067) ## Description The news session property would allow Partitioned Output Velox operators to flush (return) data eagerly, as soon as it arrives. This would match default Presto Java behavior of returning results eagerly to the caller, while the query is still running (scanning). ## Motivation and Context For "needle in a haystack" type of queries running in various UIs this early return functionality is crucial. ## Test Plan Existing session property test. Ran the custom build in a Prestissimo cluster to ensure session property changes query behavior accordingly. ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Add a native session property to control eager flushing behavior of partitioned output operators. New Features: - Introduce the native_partitioned_output_eager_flush session property to enable eager flushing of PartitionedOutput operator rows in native execution. Documentation: - Document the native_partitioned_output_eager_flush session property in the Presto native session properties reference. Tests: - Extend session property mapping tests to cover the new native_partitioned_output_eager_flush property.
prestodb#27059) Summary: MV query optimizer fails to rewrite queries when the specified table name differs between the MV definition and the incoming query (ex: `base_table` vs `schema.base_table`). This fix resolves table references to schema-qualified names, ensuring consistent table matching regardless of how the table was specified. Reviewed By: zation99 Differential Revision: D91699496 ## Summary by Sourcery Ensure materialized view query optimization consistently matches base tables regardless of schema qualification in table names. Bug Fixes: - Fix materialized view rewrites failing when base tables are referenced with different schema qualifications between the MV definition and the incoming query. Tests: - Add coverage to verify materialized view query optimization works when base tables are referenced both with and without schema-qualified names in various query shapes. ## Release Notes ``` == RELEASE NOTES == General Changes * Fix MV query optimizer by correctly resolving table references to schema-qualified names. ```
…odb#26905) Summary: Ported the IpPrefix and IpAddress tests in https://github.com/prestodb/presto/blob/master/presto-main-base/src/test/java/com/facebook/presto/operator/scalar/TestIpPrefixFunctions.java to run with Presto Native engine in presto-native-tests. This is a continuation of the work to refactor scalar function tests from `presto-main-base` to `presto-main-tests` from this PR: prestodb#26013 Also moved IpPrefixType and IpAddressType into `presto-common` from `presto-main-base` due to some dependency cycles that appeared after refactoring. == NO RELEASE NOTE ==
## Description Fix for prestodb#26685 Fix for prestodb#26808 ## Motivation and Context <!---Why is this change required? What problem does it solve?--> <!---If it fixes an open issue, please link to the issue here.--> ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == General Changes * ... * ... Hive Connector Changes * ... * ... ``` If release note is NOT required, use: ``` == NO RELEASE NOTE == ```
``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Add configurable shard count for the async data cache and wire it through server initialization. New Features: - Introduce a new system config option to control the number of async cache shards with a default value. - Expose the async cache shard count to the async data cache options during server initialization. Tests: - Add unit tests covering default and custom values for the async cache shard count system config.
…6951) ## Description Fixes Velox to Presto `IN` expression conversion. When the `IN-list` is constant, the Velox expression representation uses a constant expression with an array vector to store the list (see conversion [here](https://github.com/prestodb/presto/blob/4e91f155d0f4704325552fac3807da0efdba6a35/presto-native-execution/presto_cpp/main/types/PrestoToVeloxExpr.cpp#L780)). The Presto `IN` expression expects the values from constant `IN-list` to be distinct arguments to the `SpecialFormExpression`. The `VeloxToPrestoExpr` is modified accordingly. ## Motivation and Context Resolves prestodb#26921. ## Impact Fixes bug with `IN` expression in native expression optimizer. ## Test Plan Added e2e test. ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Fix Velox-to-Presto conversion of IN expressions to correctly construct Presto special form arguments and add coverage for the native expression optimizer. Bug Fixes: - Correct Velox IN expression conversion when the IN-list is represented as a constant array so Presto receives individual arguments instead of a single array-typed constant. Tests: - Add an end-to-end test ensuring IN expressions are handled correctly by the native expression optimizer in the sidecar plugin test suite.
…restodb#26978) ## Description Velox now supports `KHyperLogLog` type (ref: facebookincubator/velox@1165703). Adds support for this type to the `NativeTypeManager`. Also adds `KHyperLogLog` to `StandardTypes` in `presto-common` to avoid a dependency on `presto-main-base` in `presto-native-sidecar-plugin`. ## Motivation and Context Fix test failure uncovered in `presto-native-tests`. Required for prestodb#23671. ## Impact Queries with `KHyperLogLog` won't fail on sidecar enabled Presto C++ deployments. ## Test Plan Added e2e test. ``` == NO RELEASE NOTE == ```
Introduce support for Azure storage backends including Azure Blob Storage (using the wasbs:// scheme) and Azure Data Lake Storage Gen2 (using the abfss:// scheme) in the Hive connector. Key changes: - Added HiveAzureConfigurationInitializer to inject relevant Azure configurations into Hadoop Configuration - Introduced HiveAzureConfig to allow catalog-level configuration of Azure properties - Updated HdfsConfigurationInitializer and HiveConnectorFactory to delegate Azure-specific config setup - Registered configuration initializer in Hive module Supports shared key and OAuth2-based authentication.
971afcd to
7b41678
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Introduce support for Azure storage backends including Azure Blob Storage (using the wasb:// scheme) and Azure Data Lake Storage Gen2 (using the abfs:// scheme) in the Hive connector.
Key changes:
Supports shared key and OAuth2-based authentication.
Motivation and Context
Several enterprise data lake workloads are hosted on Azure storage platforms. This change allows Presto to directly query data from Azure Blob Storage and ADLS Gen2, bringing Azure compatibility in line with other cloud storage systems like Amazon S3 and Google Cloud Storage.
Impact
No breaking changes to existing Hive catalogs or connectors
Test Plan
Test done via Hive Connector.
Contributor checklist
== RELEASE NOTES ==