Skip to content

Avoid always adding columns as optional on schema evolution#15205

Open
kumarpritam863 wants to merge 13 commits intoapache:mainfrom
kumarpritam863:take_column_optional_value_from_schema
Open

Avoid always adding columns as optional on schema evolution#15205
kumarpritam863 wants to merge 13 commits intoapache:mainfrom
kumarpritam863:take_column_optional_value_from_schema

Conversation

@kumarpritam863
Copy link
Contributor

Problem

When performing schema evolution (adding a new field), the code currently calls addColumn()addInternalColumn(), which always marks the new column as OPTIONAL (i.e. nullable = true), regardless of what the Connect schema says.

This behaviour is inconsistent with table creation:

  • On initial table creation we correctly respect schema.isOptional() (and the related configuration) to decide whether a column should be REQUIRED or OPTIONAL.
  • On schema evolution we ignore that and force the new column to be OPTIONAL.

As a result:

  • If you create the table directly with schema v2 (which already contains the new field), the column is created with the correct nullability.
  • If you create the table with v1 and later evolve to v2, the same column is added as OPTIONAL → different physical schema.

Solution

This PR aligns schema-evolution nullability handling with the logic used during initial table creation.

Now the same code path (respecting ConnectSchema.isOptional() and the configured default for required fields) is used in both cases, eliminating the dual behaviour and guaranteeing that the resulting Parquet files have identical column nullability no matter whether the field was present at creation time or added later via schema evolution.

Result

  • Consistent Parquet schema across creation and evolution paths
  • New columns respect the nullability declared in the Connect schema
  • No more surprise OPTIONAL columns when the source schema says the field is required

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an inconsistency in schema evolution where new columns were always added as optional, regardless of the Connect schema's nullability specification. The fix ensures that column nullability is handled consistently whether fields are present at table creation or added later via schema evolution.

Changes:

  • Modified SchemaUpdate.AddColumn to track nullability via a new isOptional boolean field
  • Updated SchemaUtils.commitSchemaUpdates() to call either addColumn() or addRequiredColumn() based on the field's optionality
  • Added comprehensive test coverage for optional and required column handling in both flat and nested structures

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
SchemaUpdate.java Added isOptional field and parameter to AddColumn class to track nullability
SchemaUtils.java Modified commit logic to respect column optionality when applying schema updates
RecordConverter.java Updated addColumn() calls to pass nullability information based on Connect schema
TestSchemaUpdate.java Added test coverage for required column handling
TestSchemaUtils.java Updated existing tests and added new tests for required column scenarios
TestRecordConverter.java Added comprehensive tests for various column optionality scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant