Skip to content

Conversation

@joannalauu
Copy link
Contributor

@joannalauu joannalauu commented Feb 2, 2026

What changed?
Implement SQL plugin interface and add PostgreSQL support for DomainAudit.

  • Created domain_audit_log table in Postgresql schema
  • Created data structures for DomainAuditLogRow and DomainAuditLogFilter in sqlplugin
  • Implemented sqlplugin interfaces for InsertIntoDomainAuditLog and SelectFromDomainAuditLogs functions
  • Implemented InsertIntoDomainAuditLog and SelectFromDomainAuditLogs functions in Postgresql
  • Updated SQL factory to create newSQLDomainAuditStore

Fixes: #7602

Why?
DomainAudit allows all modifications made to a domain (e.g failovers) to be stored and retrieved. This has previously only been supported by NoSQL databases (Cassandra, MongoDB, DynamoDB). The SQL plugin is now configured to handle DomainAudit requests, so other SQL databases can also be easily configured to support DomainAudit. PostgreSQL support for DomainAudit is now added.

How did you test it?

  • Unit tests: go test -v ./common/persistence/sql -run TestCreateDomainAuditLog TestGetDomainAuditLogs TestFactoryNewDomainAuditStore TestInsertIntoDomainAuditLog TestSelectFromDomainAuditLogs TestGetDataBlobEncoding TestGetDataBlobBytes
  • Integration tests: Ran TestPostgresSQLDomainAuditPersistence on github actions

Potential risks

  • Postgresql schema is updated with a new table; it can be rolled back with Postgres database release version

Release notes
Added PostgreSQL support for DomainAudit

Documentation Changes
N/A


Potential Breaking change

Detailed Description
[In-depth description of the changes made to the schema or interfaces, specifying new fields, removed fields, or modified data structures]

  • Created domain_audit_log table in Postgresql schema
  • Created data structures for DomainAuditLogRow and DomainAuditLogFilter in sqlplugin
  • Implemented sqlplugin interfaces for InsertIntoDomainAuditLog and SelectFromDomainAuditLogs functions
  • Implemented InsertIntoDomainAuditLog and SelectFromDomainAuditLogs functions in Postgresql
  • Updated SQL factory to create newSQLDomainAuditStore

Impact Analysis

  • Backward Compatibility: Schema changes can be rolled back with Postgres database release version
  • Forward Compatibility: Doesn't affect existing schema

Testing Plan

  • Unit Tests: go test -v ./common/persistence/sql -run TestCreateDomainAuditLog TestGetDomainAuditLogs TestFactoryNewDomainAuditStore TestInsertIntoDomainAuditLog TestSelectFromDomainAuditLogs TestGetDataBlobEncoding TestGetDataBlobBytes
  • Persistence Tests: Ran TestPostgresSQLDomainAuditPersistence on github actions
  • Integration Tests: [Do we have integration test covering the change?]
  • Compatibility Tests: [Have we done tests to test the backward and forward compatibility?]

Rollout Plan

  • What is the rollout plan?
  • Does the order of deployment matter?
  • Is it safe to rollback? Does the order of rollback matter? Schema changes can be rolled back with Postgres database release version
  • Is there a kill switch to mitigate the impact immediately? Changing NewDomainAuditStore() to return nil, nil should revert effects of all changes

Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
- Created data structures of `DomainAuditLogRow` and `DomainAuditLogFilter`
- Implemented sqlplugin interfaces for `InsertIntoDomainAuditLog` and  `SelectFromDomainAuditLogs` functions
- Implemented `InsertIntoDomainAuditLog` and  `SelectFromDomainAuditLogs` functions in Postgresql
- Updated SQL factory to create `newSQLDomainAuditStore`

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@c-warren
Copy link
Contributor

c-warren commented Feb 3, 2026

👋 Thanks! I've got this in my plans to review, unfortunately I've been blocked by some on-call work. Will get to this ASAP!

argIndex := 5

// Handle pagination token
if filter.PageTokenMinCreatedTime != nil && filter.PageTokenMinEventID != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic of handling pagination should be done inside sql_domain_audit_store.go to avoid duplicate logic inside each SQL plugin.
You can check the code in sql_execution_store.go to see how we handle pagination in that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the relevant function:

func (m *sqlExecutionStore) ListConcreteExecutions(
ctx context.Context,
request *p.ListConcreteExecutionsRequest,
) (*p.InternalListConcreteExecutionsResponse, error) {
filter := &sqlplugin.ExecutionsFilter{}
if len(request.PageToken) > 0 {
err := gobDeserialize(request.PageToken, &filter)
if err != nil {
return nil, &types.InternalServiceError{
Message: fmt.Sprintf("ListConcreteExecutions failed. Error: %v", err),
}
}
} else {
filter = &sqlplugin.ExecutionsFilter{
ShardID: m.shardID,
WorkflowID: "",
}
}
filter.Size = request.PageSize
executions, err := m.db.SelectFromExecutions(ctx, filter)
if err != nil {
if err == sql.ErrNoRows {
return &p.InternalListConcreteExecutionsResponse{}, nil
}
return nil, convertCommonErrors(m.db, "ListConcreteExecutions", "", err)
}
if len(executions) == 0 {
return &p.InternalListConcreteExecutionsResponse{}, nil
}
lastExecution := executions[len(executions)-1]
nextFilter := &sqlplugin.ExecutionsFilter{
ShardID: m.shardID,
WorkflowID: lastExecution.WorkflowID,
}
token, err := gobSerialize(nextFilter)
if err != nil {
return nil, &types.InternalServiceError{
Message: fmt.Sprintf("ListConcreteExecutions failed. Error: %v", err),
}
}
concreteExecutions, err := m.populateInternalListConcreteExecutions(executions)
if err != nil {
return nil, &types.InternalServiceError{
Message: fmt.Sprintf("ListConcreteExecutions failed. Error: %v", err),
}
}
return &p.InternalListConcreteExecutionsResponse{
Executions: concreteExecutions,
NextPageToken: token,
}, nil
}

At the sql level we perform the pagination (add the page token to the filter, handle the result of the request and return the next page token) while at the datastore implementation level (e.g postgreSQL/SQL) we just implement a list query:

func (pdb *db) SelectFromExecutions(ctx context.Context, filter *sqlplugin.ExecutionsFilter) ([]sqlplugin.ExecutionsRow, error) {
dbShardID := sqlplugin.GetDBShardIDFromHistoryShardID(int(filter.ShardID), pdb.GetTotalNumDBShards())
var rows []sqlplugin.ExecutionsRow
var err error
if len(filter.DomainID) == 0 && filter.Size > 0 {
err = pdb.driver.SelectContext(ctx, dbShardID, &rows, listExecutionQuery, filter.ShardID, filter.WorkflowID, filter.Size)
if err != nil {
return nil, err
}
} else {
var row sqlplugin.ExecutionsRow
err = pdb.driver.GetContext(ctx, dbShardID, &row, getExecutionQuery, filter.ShardID, filter.DomainID, filter.WorkflowID, filter.RunID)
if err != nil {
return nil, err
}
rows = append(rows, row)
}
return rows, err
}

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/domainaudit-postgres branch from 170adae to ab40f58 Compare February 11, 2026 06:14
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/domainaudit-postgres branch from ab40f58 to 5dcbfe1 Compare February 11, 2026 15:54
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/domainaudit-postgres branch 2 times, most recently from 09ac487 to 2c6171e Compare February 11, 2026 23:30
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/domainaudit-postgres branch from 2c6171e to d567cd1 Compare February 12, 2026 05:35
@gitar-bot
Copy link

gitar-bot bot commented Feb 12, 2026

Code Review ✅ Approved 6 resolved / 6 findings

Well-structured PostgreSQL DomainAudit implementation with good test coverage. Previous findings have been addressed: nil DataBlob now correctly maps to empty bytes for NOT NULL columns, and SelectContext is used instead of GetContext for multi-row queries. The "SchemaUpdateCqlFiles" manifest key is a project-wide convention, not a bug.

✅ 6 resolved
Bug: Schema NOT NULL on state_before/state_after vs nil DataBlob

📄 common/persistence/sql/sql_domain_audit_store.go:183-188 📄 schema/postgres/cadence/versioned/v0.7/domain_audit_log.sql:5-8 📄 schema/postgres/cadence/schema.sql:281-284
The domain_audit_log table defines state_before BYTEA NOT NULL and state_after BYTEA NOT NULL, but the Go code allows these blobs to be nil. In sql_domain_audit_store.go, getDataBlobBytes() returns nil when the DataBlob is nil. When inserting, a Go nil byte slice is sent as SQL NULL, which will violate the NOT NULL constraint.

This is a real scenario: for domain creation operations, StateBefore will be nil (there's no prior state). The domain_audit_manager.go only sets stateBeforeBlob if request.StateBefore != nil, and similarly for stateAfterBlob.

Fix options:

  1. Change the schema to allow NULL: state_before BYTEA (without NOT NULL), or
  2. Change getDataBlobBytes to return an empty byte slice instead of nil when the blob is nil:
    func getDataBlobBytes(blob *persistence.DataBlob) []byte {
        if blob == nil {
            return []byte{}
        }
        return blob.Data
    }

Option 2 is simpler and doesn't require a schema change. The read path already handles empty byte slices correctly (len(row.StateBefore) > 0 check).

Bug: Test asserts query has no LIMIT but _getDomainAuditLogsQuery has LIMIT 1

📄 common/persistence/sql/sqlplugin/postgres/domain_audit_log_test.go:285 📄 common/persistence/sql/sqlplugin/postgres/domain_audit_log.go:52
The test case "get success with no results" at line 285 asserts assert.NotContains(t, query, "LIMIT"), but the _getDomainAuditLogsQuery used in the PageSize == 0 path clearly contains LIMIT 1 (line 52 of domain_audit_log.go). This assertion will always fail, indicating either:

  1. The test assertion is wrong and should be assert.Contains(t, query, "LIMIT 1"), or
  2. The intent was that the non-paginated path should NOT have a LIMIT clause, which means the _getDomainAuditLogsQuery should not have LIMIT 1.

Since testify/assert (not require) is used, the test will continue executing after this failure rather than stopping, masking the issue.

Bug: GetContext with multi-row query silently drops rows or errors

📄 common/persistence/sql/sqlplugin/postgres/domain_audit_log.go:96-105
When PageSize is 0 (or unset), SelectFromDomainAuditLogs uses pdb.driver.GetContext() with _getDomainAuditLogsQuery, which has no LIMIT clause and can return multiple rows. However, GetContext delegates to sqlx.DB.GetContext, which is designed for single-row results (like QueryRow). If the query returns more than one row, sqlx.Get will return an error like sql: expected 1 destination arguments in Scan, not N. If it returns exactly one row, only that single row is returned, but the intent appears to be returning all matching rows.

This means the "no pagination" path (PageSize == 0) is fundamentally broken — it will either error out or only return a single result when multiple exist.

The fix should use SelectContext (for multi-row results) instead of GetContext, and scan into a slice rather than a single pointer.

Bug: Inconsistent default time calculation for end time

📄 common/persistence/sql/sqlplugin/postgres/domain_audit_log.go:64-65
The default end time is calculated using time.Unix(0, time.Now().UnixNano()) which is equivalent to just time.Now(). However, the default start time is time.Unix(0, 0) (Unix epoch).

Issue: While this works, the code is unnecessarily complex and could be simplified. More importantly, since created_time < $4 is used (exclusive), if a caller doesn't provide a MaxCreatedTime and wants to include all records up to now, they could miss records created at exactly the same nanosecond the query is executed (though this is extremely rare).

Suggested fix: For clarity and consistency, simplify to:

start := time.Time{} // zero time 
end := time.Now().Add(time.Minute) // slight buffer to ensure "now" records are included

if filter.MinCreatedTime != nil {
    start = *filter.MinCreatedTime
}
if filter.MaxCreatedTime != nil {
    end = *filter.MaxCreatedTime
}

Or consider using <= instead of < for the upper bound comparison in the query.

Quality: MySQL domain audit log methods silently return nil

📄 common/persistence/sql/sqlplugin/mysql/domain_audit_log.go:30-43
The MySQL implementation of InsertIntoDomainAuditLog and SelectFromDomainAuditLogs returns nil, nil without any error, which silently fails. This could lead to data loss since callers would believe the operation succeeded when in fact no data was persisted.

Impact: If MySQL is used as the persistence layer and domain audit logs are enabled, the caller will think the audit log was created successfully, but no data will be stored. This is a silent failure mode that could go undetected in production.

Suggested fix: Return an appropriate error indicating the feature is not yet implemented:

func (mdb *DB) InsertIntoDomainAuditLog(ctx context.Context, row *sqlplugin.DomainAuditLogRow) (sql.Result, error) {
    return nil, fmt.Errorf("domain audit log is not implemented for MySQL")
}

func (mdb *DB) SelectFromDomainAuditLogs(ctx context.Context, filter *sqlplugin.DomainAuditLogFilter) ([]*sqlplugin.DomainAuditLogRow, error) {
    return nil, fmt.Errorf("domain audit log is not implemented for MySQL")
}

Alternatively, consider adding a feature flag or capability check that allows callers to verify if domain audit is supported by the persistence layer.

  • ✅ Manifest uses incorrect key name "SchemaUpdateCqlFiles"
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@joannalauu joannalauu requested a review from Shaddoll February 12, 2026 06:29
Copy link
Contributor

@c-warren c-warren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks for making those changes.

Comment on lines +2 to +3
domain_id BYTEA NOT NULL,
event_id BYTEA NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd err on the side of TEXT here so that the IDs are human readable natively in a databases tooling.

identity TEXT NOT NULL,
identity_type TEXT NOT NULL,
comment TEXT NOT NULL DEFAULT '',
PRIMARY KEY (domain_id, operation_type, created_time, event_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a composite primary key in Cassandra because its support for secondary indexes is limited (in our implementation anyway). Because of that we have to stuff everything we want to query by into the primary key. For this implementation I think its fine to keep that behaviour as we will be passing all of these things from the layer above postgres, but I think you could equivalently just use the event_id as the primary key so that it is possible to fetch a row purely by event_id.

I'd also recommend adding secondary indexes on created_time and operation_type so that we can effectively filter by those records when building other views (e.g the most recent updates to visibility for example).

DomainID serialization.UUID
EventID serialization.UUID
StateBefore []byte
StateBeforeEncoding string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an encoding type in commonconsts which we should use here - we want to ensure that we're using a valid/supported encoding.

StateBefore []byte
StateBeforeEncoding string
StateAfter []byte
StateAfterEncoding string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +40 to +41
domainID := serialization.MustParseUUID("d1111111-1111-1111-1111-111111111111")
eventID := serialization.MustParseUUID("e1111111-1111-1111-1111-111111111111")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I recommend creating a unique UUID whenever the test is run as a best practice. It prevents hardcoding/reusing expected UUIDs and ensures the code path can't be affected by any other tests. In this particular test it doesn't matter much, but something to keep in mind.

Comment on lines +149 to +172
auditLog := &persistence.InternalDomainAuditLog{
EventID: row.EventID.String(),
DomainID: row.DomainID.String(),
OperationType: row.OperationType,
CreatedTime: row.CreatedTime,
LastUpdatedTime: row.LastUpdatedTime,
Identity: row.Identity,
IdentityType: row.IdentityType,
Comment: row.Comment,
}

if len(row.StateBefore) > 0 {
auditLog.StateBefore = &persistence.DataBlob{
Encoding: constants.EncodingType(row.StateBeforeEncoding),
Data: row.StateBefore,
}
}

if len(row.StateAfter) > 0 {
auditLog.StateAfter = &persistence.DataBlob{
Encoding: constants.EncodingType(row.StateAfterEncoding),
Data: row.StateAfter,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be handy to have as a common deserialization function for whenever we add GetAuditLog(event_id) for example. It'll also be easier to test deserialization from a PostgreSQL row and any error handling we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL support for DomainAudit

3 participants