Conversation
It isn't used publicly anywhere, and it adds more API surface area to worry about. This is still backwards compatible, as self._len is still saved/restored byt pickle the same way, now we just access it directly instead of going via __len__().
We iterate through the input more than once, so an Iterable is insufficient.
Before, we never checked for this, so when we called list.index() we just got back the first instance. I swapped us over to a lookup table because that makes more sense, but I wanted to also add this check because if someone had duplicate names, this would quitely change the behavior underneath someone: before it gave the first index, now it gives the last.
These variables 1. you can't currently create them because they inherit from Variable, not FieldVariable, so they don't appear in the datamodel.VARIABLE_CLASSES lookup table 2. You SHOULDN'T be able to instantiate these variables directly from a variable definition
Before the flow was: 1. create all the primary variables EXCEPT for the interactions 2. Expand those. 3. go back through and NOW create the InteractionVariables I found this confusing. We parse the var definitions twice in separate places, and InteractionVariables are a special case in the first pass. The other point of this is to make it so that there is one list of Variable instances which is the single source of truth. Once we do this then we can turn all the other private instance variables of DataModel into @functools.cached_property's, which will make it much easier to ensure pickle compatibility in the future, because only one variable needs to get saved and restored
| all_variables.append(variable_object) | ||
|
|
||
| if only_custom: | ||
| no_blocking_variables = all( |
There was a problem hiding this comment.
Continuing from #1098 (comment)
I think it would be better if this was more duck-typed though, such as
no_blocking_variables = not any(len(v.predicates) for v in variables).
That would be a slight change though, since Variable.predicates is filled with an ExistsPredicate for any var def with the has missing flag. Perhaps this is OK? But I don't think so.
I think really there needs to be a more official API where variables declare the predicates they export.
There was a problem hiding this comment.
the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.
https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37
if we move that code, then we could do pretty much what you are saying.
no_blocking_variables = not any(len(v.predicates) for v in variables)we might also want to have a check like
only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
warning.warn('You are only using the Exists Predicate which is usually pretty bad')| all_variables += interactions(variable_definitions, self.primary_variables) | ||
| variables = typify_variables(variable_definitions) | ||
| non_interactions: list[FieldVariable] = [ | ||
| v for v in variables if not isinstance(v, InteractionType) # type: ignore[misc] |
There was a problem hiding this comment.
This is another case where I think a Variable should have a more official API for if they export if they are an Interaction or not. Maybe should think of a better name for this too. Atomic vs derived? leaf vs nonleaf? base vs derived?
| no_blocking_variables = all( | ||
| isinstance(v, (CustomType, InteractionType)) for v in variables | ||
| ) | ||
| if no_blocking_variables: |
There was a problem hiding this comment.
Perhaps this check should happen in DataModel.init, and not in this utility function?
There was a problem hiding this comment.
it seems of a piece with the "Incorrect variable specification: variable" error?
| var_d = {var.name: var for var in primary_variables} | ||
|
|
||
| def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]: | ||
| field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)} |
There was a problem hiding this comment.
Again, this should feels like it needs a more official API, or should be responsibility of the InteractionType
There was a problem hiding this comment.
i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.
- isolating the interaction features for us, so we can put them in the right place in the sequence of features.
- actually resolving the feature indices.
seems like we kind of need to do both.
| all_variables.append(variable_object) | ||
|
|
||
| if only_custom: | ||
| no_blocking_variables = all( |
There was a problem hiding this comment.
the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.
https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37
if we move that code, then we could do pretty much what you are saying.
no_blocking_variables = not any(len(v.predicates) for v in variables)we might also want to have a check like
only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
warning.warn('You are only using the Exists Predicate which is usually pretty bad')| no_blocking_variables = all( | ||
| isinstance(v, (CustomType, InteractionType)) for v in variables | ||
| ) | ||
| if no_blocking_variables: |
There was a problem hiding this comment.
it seems of a piece with the "Incorrect variable specification: variable" error?
| result.extend(variable.higher_vars) | ||
| else: | ||
| result.append(variable) | ||
| return result |
There was a problem hiding this comment.
what if we added an iter method to Variable so that we could just do something like.
features = []
for variable in variables:
for feature in variable:
features.append(feature)
There was a problem hiding this comment.
I think this is a good idea in general, but I would want to name it to differentiate between predicates and features, since Variables supply both. Perhaps Variable.features?
| var_d = {var.name: var for var in primary_variables} | ||
|
|
||
| def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]: | ||
| field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)} |
There was a problem hiding this comment.
i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.
- isolating the interaction features for us, so we can put them in the right place in the sequence of features.
- actually resolving the feature indices.
seems like we kind of need to do both.
|
|
||
|
|
||
| class DerivedType(Variable): | ||
| type = "Derived" |
There was a problem hiding this comment.
I might be wrong but my understanding is:
This would only be used if someone passes in a variable definition like {"type": "Derived"}, and then we would create an instance of DerivedType, but that doesn't make sense because DerivedType is an abstract base class and shouldn't ever be created directly.
|
|
||
|
|
||
| class MissingDataType(Variable): | ||
| type = "MissingData" |
There was a problem hiding this comment.
Same reason as for DerivedType
|
thanks for working through this @NickCrews. this is some of the first code i wrote for project and there's a lot i didn't know! |
Split off from #1098
This makes it easier to create Variable instances from variable definitions, because Variables are responsible for both predicates and distance functions. So, if we split DataModel into these two separate parts, we're going to need to do something like
so it would be important that we can create our variable instances once, and then pass those around to the other parts.