Skip to content

WIP: Functional Cached Schema Visitors#1871

Open
Milyardo wants to merge 3 commits intodisneystreaming:series/0.19from
Milyardo:zpowers/pureCaching
Open

WIP: Functional Cached Schema Visitors#1871
Milyardo wants to merge 3 commits intodisneystreaming:series/0.19from
Milyardo:zpowers/pureCaching

Conversation

@Milyardo
Copy link

@Milyardo Milyardo commented Jan 14, 2026

This is a WIP PR to share an approach to traversing schemas in a more principled way. The current status quo with schema visitors have a few problems that we look to address:

1.) Address bugs around the caching of targets of compilation via side effects
2.) Improve stack safety by not allowing visitors to recursively call themselves

To these ends this PR implements traversal in 2 passes. The first pass is a corecursive process that unfolds a schema into a rose tree of schema nodes, terminating recursion in the schemas by ending traversal when a node is already been seen in current path from the root node. The second pass traverses the finite tree of schemas nodes in post order, providing lazy references to compilation targets to the visitor.

The first pass is implemented in TreeSchemaVisitor.unfoldToTree. This method recurses from the root node down into child nodes as long a child node has not been seen before. The following recursive tree defined in smithy:

union Tree {
  tree: TreeNode
  leaf: LeafNode
}

structure TreeNode {
  @required
  left: Tree
  @required
  right: Tree
}

structure LeafNode {
  @required
  value: Integer
}

will result in a tree that looks like this:

────Tree: LazySchema
    └── Tree: UnionSchema
        ├── TreeNode: BijectionSchema
        │   └── TreeNode: LazySchema
        │       └── TreeNode: StructSchema
        │           ├── Tree: LazySchema
        │           │   └── Tree: UnionSchema
        │           │       ├── TreeNode: BijectionSchema
        │           │       └── LeafNode: BijectionSchema
        │           │           └── LeafNode: StructSchema
        │           │               └── Integer: PrimitiveSchema
        │           └── Tree: LazySchema
        │               └── Tree: UnionSchema
        │                   ├── TreeNode: BijectionSchema
        │                   └── LeafNode: BijectionSchema
        │                       └── LeafNode: StructSchema
        │                           └── Integer: PrimitiveSchema
        └── LeafNode: BijectionSchema
            └── LeafNode: StructSchema
                └── Integer: PrimitiveSchema

The second pass is implemented in TreeSchemaVisitor.cachedPostOrder where during this post order traversal, an monadic cache effect is provided to the compilation function to fetch references to compilation targets the current target may depend on. Users are meant to implement the VisitorF trait instead of interacting which the cache directly, where VisitorF.apply finds any dependencies for end users and fetches references in the cache for them.

VisitorF is a an effectful visitor with a similar interface to existing visitors, however instead of being given a Schema for any child nodes, the visitor is given lazy references to targets instead. This makes it so schemas don't have to recursively call themselves in order to obtain dependencies. Example Hash and Show visitors have been implemented in TreeVisitorHash and TreeVisitorShow.

Below is a list of work to still be done in this PR:

  • Implement methods to compose VisitorFs
  • Implement more Visitors
  • Figure out what to do with older visitor implementations

@CLAassistant
Copy link

CLAassistant commented Jan 14, 2026

CLA assistant check
All committers have signed the CLA.

shapeId: ShapeId,
hints: Hints,
tag: CollectionTag[C],
member: Lazy[G[A]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of cases, schemas visitors delegate to other schema visitors. We want to give implementors the freedom to do that, without having to bake the complexity into the G data structure.

I think removing the def apply(schema: Schema[A) : G[A] to prevent implementors from making an uncontrolled recursive call is smart, but at the same time the interface you propose constrains the implementation patterns too much.

To get a sense of the problem : you should try and implement Document encoders and/or decoders against your POC. The implementation of the map component needs to delegate to a key encoder/decoder, and therefore is a good example that should help validate the ergonomics of your proposed solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be to have the Cache type not carry the "codec" type parameter, and expose it to the implementers. It'd look like this :

trait Cache[F[_]]{
    def getOrCompute[Codec[_], A](visitor: Visitor[F, Codec], schema: Schema[A]) : F[Codec[A]] 
    def getOrComputeRecursive[Codec[_], A](visitor: Visitor[F, Codec], schema: Lazy[Schema[A]], makeCodec: Lazy[Codec[A]] => Codec[A]) : F[Codec[A]] 
}

And then the Visitor[F] would look like :

// NB, Zipper is Smithy4s' equivalent to cat's Applicative
abstract class Visitor[F[_] : Cache : Zipper, Codec[_]]{
   def list[A](shapeId: ShapeId, hints: Hints, member: Schema[A]) : F[Codec[List[A]]]
   // other methods mirroring the Schema ADT 
}

NOT EXPOSING A recursive apply like the current SchemaVisitor. This would nearly prohibit the implementors calling un-managed recursion, by really encouraging them to use the provided implicit Cache, passing this: Visitor to its methods (or another visitor should delegation be needed). The implementation entries stored in the Cache's internal map would be keyed by (Visitor, Schema), thus giving the flexibility to have visitors that would be able to delegate to other ones whilst making sure that in the same F program, we'd get a single computation for every codec needed within a service (or even across services)

* A Scala encoding of a rose tree, a tree data structure with a variable and unbounded number of branches per node.
* @see https://en.wikipedia.org/wiki/Rose_tree
*/
sealed trait Tree[+A] {
Copy link
Contributor

@Baccata Baccata Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used in anywhere outside of tests ? I'd rather not expand the API of core unless absolutely necessary

I don't know this is necessarily a good fit for the problem, conceptually speaking, in that it seems you'd be recomputing a tree for each new root. Like, if you have the following smithy service :

trait Foo {
   def op1(a: A) 
   def op2(b: B) 
   def op3 (c: C) 
}

where

case class B(a: A)
case class C(b: B) 

don't you end up recomputing trees for A a few times ? When computing codecs, datatypes shouldn't be considered in isolation, but rather as part of a larger whole. A type may be the root of a tree in the context of an operation, and a sub node in another.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rose trees are intermediate data structures that define how we recurse through a Schema they are constructed and immediately discarded after compilation, and some kind of data structure is necessary by the nature of the problem. If we were to solve this problem using recursion schemes Tree would be a our fix point.

Originally I had taken steps to avoid the construction of a intermediate data structure and basically implemented our own version of hylo. I found in trying to do so, I was basically defining a new library for recursion schemes, was implementing a bunch of machinery to be to avoid a problem that I wasn't sure existed in the first place.

So, I avoided the problem by making the recursive and corecursive traversal explicit, and made things easier for people follow along who weren't well versed in recursion schemes. In practice the allocations don't seem to make much a difference in compile times, but I plan on testing it against the entire schema registry at Disney sometime soon.

Copy link
Contributor

@Baccata Baccata Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and some kind of data structure is necessary by the nature of the problem

That's not necessarily true. I get how you'd reach this conclusion from the point of view that schema compilation is always akin to an hylomorphism, but modelling it this way has negative implications over the ergonomics available to implementors when delegating to other compilation is necessary. Ironically the initial design of smithy4s schema visitors was similar to what you're proposing here (granted, we weren't trying to think really hard about caching back then), but it ended up being much more annoying than the current status quo.

So the middle ground I'm after would be one that would allow some level of control from the user to do something with a member schema received as an input to the visitor method, without giving them control over the general recursion, and what I'm suggesting here (or better, a free-applicative version of that) would allow, I think, to get us that sweet spot.

case BijectionSchema(s, bijection) =>
underlying(BijectionSchema(this(s), bijection))
case LazySchema(suspend) =>
underlying(LazySchema(suspend.map(this.apply)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 very good catch !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants