New feature: Get top n columns by michaelkonstantinou · Pull Request #55 · delftdata/valentine

michaelkonstantinou · 2023-07-11T01:21:53Z

Resolves #52

As stated in issue #52 , it would be useful to be able to get the top n similar columns when analyzing the data. Since the issue is still open, I decided to add this feature myself as I could use it during my data preprocessing

Solution

This pull request adds two new methods into the metrics.py file

get_top_n_columns which returns a complete dictionary of all top-n columns for each column in the two datasets
For instance: {('Table1', 'Authors'): ['Authors, 'AccessList']}
get_top_n_columns_for_column Something that could be more useful in my opinion, to return the top columns for a specific dataset
For instance: What are the top two matches for column 'Access' of table 1?

I am not quite sure what exactly the OP wanted or what the team would prefer to, but at least a boilerplate is established and in case more information should be added that can be easily modified. (e.g. add float value next to it)

Additional changes

Added a new example to demonstrate the new feature. It uses a different algorithm though as COMA compares the names as well and in this case it might not be much informative

Notes

I didn't find a test case for metrics that's why I didn't add one
The code style being used complies with PEP-8

I hope this is useful. Let me know if you prefer any changes or any additional functionality.

Archer6621 · 2023-09-18T19:54:34Z

Hello @Mikhail-Konstantinou , first of all thank you for your contribution, it's great to see contributions from the outside being made.

Overall the code looks good!

I have a couple of comments for you to take into consideration:

I feel like the two methods have significant overlap in functionality. It would make sense if get_top_n_columns would use get_top_n_columns_for_column somehow. Another option is to provide get_top_n_columns with a keyword argument that allows to you specify a list of specific columns of df1 to use for top n in df2 (and by default have it pick all columns), so you could get rid of the second method that has an overly long name :)
Maybe it's nice to have a list of dicts, with column name as key and score as value, instead of just a list of column names. Doing this gives insight into the distribution of the scores. I think this is also what you suggested with the "add float value" remark.

EDIT: After a second look I dropped some of my comments, so I've adjusted the post.

…column has been deleted

michaelkonstantinou · 2023-10-22T19:10:58Z

@Archer6621

Hello and thanks for your input. I believe the final changes solve both of the issues/suggestions you mentioned

Indeed, get_top_n_columns_for_column is long and not needed anymore. I refactored get_top_n_columns to accept a list of keys. If not, it returns all columns by default as you suggested. However, I changed it a bit and instead of choosing which columns from df1 you want... you can choose which columns you want either from df1 or from df2. I believe the latter is stronger, more flexible and cleaner
Yes, now it returns a list of dicts. The column name is the key and the score is the value

PS. I checked the conflicting files that github complains about, and they are not related to this function. I believe you can merge it easily by selecting the line of code you think is correct

michaelkonstantinou added 2 commits July 11, 2023 03:01

Added: Return top n columns metrics

43c2986

Updated: Comments

6c8ac55

michaelkonstantinou added 2 commits October 22, 2023 19:19

Added: Score value in get_top_n_columns result

ae8442d

Refactored: get_top_n_columns accepts keys and get_top_n_columns_for_…

4976891

…column has been deleted

Archer6621 mentioned this pull request Jan 24, 2024

API Refactor - MatcherResults and metrics #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: Get top n columns#55

New feature: Get top n columns#55
michaelkonstantinou wants to merge 4 commits intodelftdata:masterfrom
michaelkonstantinou:master

michaelkonstantinou commented Jul 11, 2023

Uh oh!

Archer6621 commented Sep 18, 2023 •

edited

Loading

Uh oh!

michaelkonstantinou commented Oct 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelkonstantinou commented Jul 11, 2023

Solution

Additional changes

Uh oh!

Archer6621 commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelkonstantinou commented Oct 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Archer6621 commented Sep 18, 2023 •

edited

Loading