Conversation
…es, there is no difference in the logic
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| // row at index 'x-1' of dfA was removed | ||
| xPrev + 1 == x && yPrev + 1 != y -> { | ||
| comparisonDf = comparisonDf.concat( | ||
| dataFrameOf |
There was a problem hiding this comment.
strange formatting, can you lint this file?
There was a problem hiding this comment.
also probably best to add argument names to ComparisonDescription, "magic" argumentless constants are often a source of bugs :)
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| internal class ComparisonDescription( | ||
| val rowAtIndex: Int, | ||
| val of: String, | ||
| val wasRemoved: Boolean?, |
There was a problem hiding this comment.
this results in true or null... what's wrong with false? Same below
There was a problem hiding this comment.
so... rows can either be removed or inserted, right? nothing else. Let's turn these two booleans in an enum as well then
There was a problem hiding this comment.
I choose for null instead of false to remark that if wasInserted is true, wasRemoved column does not even make sense (and vive-versa).
There was a problem hiding this comment.
I see; well, in that case an enum makes more sense I think
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| val of: String, | ||
| val wasRemoved: Boolean?, | ||
| val wasInserted: Boolean?, | ||
| val afterRow: Int?, |
There was a problem hiding this comment.
maybe call this insertedAfterRow or something more expressive. That explains why it's null if wasRemoved == true
There was a problem hiding this comment.
Correct me if I'm wrong, but honestly, this seems a thing AI would do wrong. If you're using AI, that's okay; however, you remain the one responsible for the code.
At first you had:
val wasRemoved: Boolean?
val wasInserted: Boolean?
val afterRow: Int?I remarked here: #1556 (comment) both values had two options: either null or true. This is an odd thing to do with booleans which, as you know, have 2 values already: true or false, however, you explained why, so that makes slightly more sense now :)
What I don't understand is why you now have two nullable RowOfComparisons:
val wasRemoved: RowOfComparison?
val wasInserted: RowOfComparison?What would these even contain? A row is either 'removed' (WAS_REMOVED) or 'inserted' (WAS_INSERTED_AFTER_ROW), right? So why not just make one column called modification: RowOfComparison?
Next, I meant you to rename afterRow to insertedAfterRow, not wasInserted.
There was a problem hiding this comment.
Honestly, no AI at all was used in this code. I misunderstood #1556 (comment),
I thought that I had to create an enum whose constants had an exact correspondance with boolean values true and false (so that the schema could remain the same) .
However, now I understood what You meant :), I proceed to correct.
There was a problem hiding this comment.
Then I take back what I said :) Thanks for the explanation!
| val wasRemoved: Boolean?, | ||
| val wasInserted: Boolean?, | ||
| val afterRow: Int?, | ||
| ) : DataRowSchema |
There was a problem hiding this comment.
I wonder if we could include the modified DataRow<*> in the resulting DataFrame as well. That could make it a bit easier
There was a problem hiding this comment.
Myers difference algorithm exploits the idea of comparison in a boolean sense,
a member (row) of the data structure (df) is equal or not-equal to another member.
In this logic a modified row is represented by two DataRow<ComparisonDescription> ,
One row is like: row n of dfA was removed and the other: row m of dfB was inserted after row n-1 (of dfA).
There was a problem hiding this comment.
Imo representing explicitly modified rows means customing Myers Alg logic by introducing a 3-possible-output non boolean logic of comparison: Equal, Non Equal, Similar. Similar means that compared row differ for a limited number of elements (that may be proportional to row's length).
In a lower abstraction level, 'similar' looks like a 'flagged' diagonal move in the edit script graph
-> list of Pair would be no more enough to represent the result, we would need a list of 3-ple.
Anyway, imo, this slution has the following weakness: the result is less neutral because the previous definition of 'Similarity' can't determine with certainty wheter a row was actually modified.
There was a problem hiding this comment.
I'm not (yet) asking for in-row differences. Simply for adding the original row to a new column when that row was "Not Equal", so either "removed" or "inserted". A ComparisonDescription like row n of dfA was removed doesn't really help me if I can't see what "row n of dfA" actually contains. Similar to row m of dfB.
Does that make sense?
There was a problem hiding this comment.
Yes, it would make the comparison output more independent. I can try to implement it.
There was a problem hiding this comment.
Yes, it would make the comparison output more independent. I can try to implement it.
Done, i added a DataRow<T> column to ComparisonDescription<T> representing the content of the row.
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
|
If there is anything I can do, I am at full disposal :) |
Sure! However, in the team we currently give more priority to fixing bugs for the 1.0 release rather than adding new features like this. This will likely be a 1.1 thing, so it has lower priority for us; we'll likely revisit it later. However, if you want, you could explore how DataFrame users would actually use this new functionality. Your PR only provides internal implementation code, after all. Similar to |
I think that comparing dataframes could be very usefull when the goal is to monitor something in the time. |
Yeah I think that makes sense :) maybe give it a try with a draft API and a notebook using it (you can publish DF to maven Local and run it in a notebook using: USE { repositories(mavenLocal()) }
%use dataframe(v=1.0.0-dev)) |
FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.
Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.
It comes with a proper test case.
About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf