Skip to content

The cost is not optimal #89

@aeft

Description

@aeft
from_tree = json.build_tree(["de", 1])
to_tree = json.build_tree([2])

print(from_tree.diff(to_tree).edited_cost())

for edit in from_tree.get_all_edits(to_tree):
    print(edit)

output:

4                                                                                                                                                                                                       
Insert(to_insert=IntegerNode(2), insert_into=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                              
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))
Remove(IntegerNode(1), remove_from=ListNode((StringNode('de'), IntegerNode(1))))

But the minimal cost should be 3.
output:

3                                                                                                                                                                                                       
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                                      
Match(match_from=IntegerNode(1), match_to=IntegerNode(2), cost=1)

I found the problem is in EditDistance._best_match:

dcost = (self.costs[row - 1][col - 1], self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1], self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col], self.path_costs[row - 1][col])
diag_is_best = dcost <= lcost and dcost <= ucost
if diag_is_best:
    make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
if diag_is_best and \ 
        self.edit_matrix[row][col].bounds() < self.edit_matrix[row][0].bounds() and \
        self.edit_matrix[row][col].bounds() < self.edit_matrix[0][col].bounds():  # condition1
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost:  # condition2
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else: # condition3
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol] + 1
self.costs[row][col] = self.costs[brow][bcol] + edit.bounds().upper_bound
# In the above example, when we call _best_match(1, 1).
# value information:
# self.edit_matrix[row][col].bounds() = (2,2)
# self.edit_matrix[0][col].bounds() = (2,2)
# ucost = (2, 1)
# dcost = (0, 0)
# lcost = (1, 1)

brow = row, bcol = col-1 (condition3 success) because self.edit_matrix[row][col].bounds() == self.edit_matrix[0][col].bounds() (condition1 fail) and ucost > dcost (condition2 fail).
But actually self.costs[row][col-1] + self.edit_matrix[0][col].bounds().upper_bound > self.costs[row-1][col-1] + self.edit_matrix[row][col].bounds().upper_bound. The optimal self.costs[row][col] should be self.costs[row-1][col-1] + self.edit_matrix[row][col].bounds().upper_bound.

Here is a fix (success in the above example, may need further testing):

make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
dcost = (self.costs[row - 1][col - 1] + self.edit_matrix[row][col].bounds().upper_bound, self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1] + self.edit_matrix[0][col].bounds().upper_bound, self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col] + self.edit_matrix[row][0].bounds().upper_bound, self.path_costs[row - 1][col])
if dcost <= lcost and dcost <= ucost:
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost and ucost <= lcost:
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else:
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol] + 1
self.costs[row][col] = self.costs[brow][bcol] + edit.bounds().upper_bound

In summary, we should use the final real cost to guide transition. Like the canonical implementation of the Levenshtein distance metric:

dist[row][col] = min(dist[row - 1][col] + 1,
                                 dist[row][col - 1] + 1,
                                 dist[row - 1][col - 1] + cost)

If this is a bug, I would be happy to provide a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions