Skip to content

Conversation

@dorianvp
Copy link
Member

@dorianvp dorianvp commented Oct 8, 2025

On top of #592.

Motivation

#210.

This PR also specifies the algorithm used to generate the UTXO set snapshot hash.

Solution

Implement gettxoutsetinfo.

PR Checklist

  • The PR name is suitable for the release notes.
  • The solution is tested.
  • The documentation is up to date.

@dorianvp dorianvp self-assigned this Oct 8, 2025
@dorianvp dorianvp linked an issue Oct 8, 2025 that may be closed by this pull request
@dorianvp dorianvp force-pushed the feat/rpc-gettxoutsetinfo branch from 876bba8 to 1a4a340 Compare October 10, 2025 03:21
@dorianvp
Copy link
Member Author

This PR is mostly ready. I only left a couple of review to upgrade some types, but functionally it is ready for review.

@dorianvp dorianvp marked this pull request as ready for review October 14, 2025 05:04
@dorianvp dorianvp requested review from AloeareV and idky137 October 14, 2025 05:05
@zancas
Copy link
Member

zancas commented Oct 16, 2025

I see several equivalence tests, how about some non-equivalance tests proving sensitivity to variance in txoutsets?

- `UTXO set`: a finite multimap keyed by outpoints `(txid, vout)` to outputs `(value_zat, scriptPubKey)`, where:

- `txid` is a 32-byte transaction hash (internal byte order).
- `vout` is a 32-bit output index (0-based).
Copy link
Member Author

@dorianvp dorianvp Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note here:

If we serialize per unspent as txid || value || script and a transaction contains two outputs with identical (value, script), then two different UTXO sets that differ only by which index is unspent will serialize to the same bytes (and hash).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vout is a misleading name for this; I would call it output_index because vout is generally used to refer to the vector of outputs of a transaction; I was confused by this name before I got to this line.


The snapshot **MUST** be ordered as follows, independent of the node’s in-memory layout:

1. Sort by `txid` ascending, comparing the raw 32-byte values as unsigned bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bad serialization, because it requires recomputation over the entire UTXO set whenever a new block is received. The UTXO set can be very large; it would be much better to choose a snapshot protocol where snapshot hashes can incrementally build on the snapshot hash prior to the addition of a new block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the UTXO set were stored in a B-tree data structure that internally kept Merkle hashes at the nodes, then it might be okay to use the Merkle root of that data structure for the snapshot identifier. It would need to be the case that the fanout of the B-tree and the insertion semantics were well-specified to ensure that everyone uses the same hashing approach.

One possibility that would allow for this to work as-specified would be to use a separate B-tree (implementing a set, rather than a map) for producing the hashes; since the txid commits to the effects of each transaction, one could build the snapshot identifier alongside the actual data, but building that identifier in parallel would have a risk of data inconsistencies with the primary store.

In general, I feel like the UTXO set would be best represented as a persistent data structure with good amortized append costs.

Copy link

@arya2 arya2 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UTXO set can be very large; it would be much better to choose a snapshot protocol where snapshot hashes can incrementally build on the snapshot hash prior to the addition of a new block.

sparse-merkle-tree could be useful here, Zaino could:

  • Implement Value for a struct representing the transaction output data to which hash_serialized is committing,
  • Implement StoreReadOps/StoreWriteOps for on-disk storage of the tree,
  • Update the tree and a cache with the other fields in TxOutSetInfo when Zaino is indexing blocks, and
  • Return the cached TxOutSetInfo from the RPC method.

Copy link

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some performance and concurrency issues with this design. It would be better to update a cached TxOutSetInfo as Zaino indexes blocks and to use a merkle tree to update the hash_serialized field.


let mut state = state.clone();

let zebra_state::ReadResponse::Tip { 0: tip } = state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Why destructure it this way rather than as a tuple struct?

Suggested change
let zebra_state::ReadResponse::Tip { 0: tip } = state
let zebra_state::ReadResponse::Tip(tip) = state

}
}

/// Fetches all UTXOs from the state service, and returns them in a map.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading every UTXO into memory shouldn't be necessary and could require more memory than is available on some systems.


## Motivation

Different nodes (e.g., `zcashd`, Zebra, indexers) may expose distinct internals or storage layouts. Operators often need a cheap way to verify “we’re looking at the same unspent set” without transporting the entire set. A canonical, versioned snapshot hash solves this.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that zcashd had this method returning a hash_serialized already, but, why is it not enough to check that the block hashes match?


## Inputs

To compute the snapshot hash, the implementation needs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include anything other than the UTXOs as inputs in the snapshot hash? Shouldn't we already know that we're looking at the same UTXO set if the best block hashes match?


The snapshot **MUST** be ordered as follows, independent of the node’s in-memory layout:

1. Sort by `txid` ascending, comparing the raw 32-byte values as unsigned bytes.
Copy link

@arya2 arya2 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UTXO set can be very large; it would be much better to choose a snapshot protocol where snapshot hashes can incrementally build on the snapshot hash prior to the addition of a new block.

sparse-merkle-tree could be useful here, Zaino could:

  • Implement Value for a struct representing the transaction output data to which hash_serialized is committing,
  • Implement StoreReadOps/StoreWriteOps for on-disk storage of the tree,
  • Update the tree and a cache with the other fields in TxOutSetInfo when Zaino is indexing blocks, and
  • Return the cached TxOutSetInfo from the RPC method.

Comment on lines +893 to +894
let blk = state
.call(ReadRequest::Block(HashOrHeight::Height(Height(h))))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be faster to read just the unspent outputs by iterating over TransactionLocation { height: 0, index: 0 }..={ height: target_height, index: OutputIndex::MAX } in the utxo_by_out_loc column family and then read any remaining blocks to add any UTXOs that were in blocks that were commited during the iteration.

};

for tx in block.transactions.clone() {
let txid = tx.hash();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be faster to lookup the transaction hashes in the tx_loc_by_hash column family.

async fn get_txout_set_info(&self) -> Result<GetTxOutSetInfo, Self::Error> {
let txouts = Self::get_txout_set(&self.read_state_service).await.unwrap();

let best_block_hash = self.get_best_blockhash().await.unwrap().hash();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_txout_set() may not (and on Mainnet, generally won't) read transactions from all blocks up to the best block hash being read here. It would be better to return the tip hash from get_txout_set().

@dorianvp
Copy link
Member Author

dorianvp commented Nov 3, 2025

Hey folks, thanks for reviewing this! I'll be addressing these comments soon

@zancas
Copy link
Member

zancas commented Nov 4, 2025

@dorianvp this RPC involves specification which is scheduled in the next Mile Stone. I am requesting to move this RPC to that Mile Stone.

@zancas zancas assigned AloeareV and unassigned dorianvp Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1.14: Add Zcash RPC gettxoutsetinfo

5 participants