Author reconstruction for 'GitHub' user#53
Conversation
This allows for reconstruction of correct commit author if user is github Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added one comment for clarity Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
|
This PR should also be ready for review. I was unable to test my changes in python 2, but since we are porting to python 3 in the near future I hope this is not an issue. I do also have the tested python 3 code locally, which could make @maxloeffler work a bit easier. |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses issue #52 by implementing author reconstruction for events triggered by the 'GitHub' user. The main purpose is to capture and preserve commit author information in event data so that the actual commit author can be reconstructed during post-processing, even when the event appears to be triggered by the generic 'GitHub' user.
- Adds commit author information to
event_info_2field forcommit_addedevents - Implements connected events filtering and matching logic to handle related issue connections
- Updates author post-processing to use the preserved author information as a fallback when commit data is unavailable
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| issue_processing/issue_processing.py | Adds author preservation logic, connected events handling, and updates user processing to include commit author information |
| author_postprocessing/author_postprocessing.py | Implements fallback author reconstruction using preserved author names when commit hash lookup fails |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
bockthom
left a comment
There was a problem hiding this comment.
Could you please address the following issues, as well as the issues previously pointed out by Copilot, as well as the other unresolved one regarding necessary comments from a previous review of mine?
Thank you!
|
And the copyright header in |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
also save merge commits reconstruction of connected events is done by first saving all connected events that occured at the same time. Then, it is possible to match connected events iff: - half of the involved issues are equal, meaning that one issue is connected to multiple others - half rounded up of the involved isses are equal, meaning that we have one external connected event and then the previous case with the remaining issues Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
since data is modified in-place, return of input data is not needed Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
ALso add commit hash if closed by commit Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also rename 'new feature' to 'feature' Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also remove duplicates from type list Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
using empty line reserved for jira components Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added copyright header Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also minor fixes and removal of math.ceil Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
comments now each have a boolean field that describes whether the comment contains a suggestion or not Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
dicts for reconstructing connected events are now better explained and the comments do not disruot the workflow in the run function anymore Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
includes: - updated comments - spelling mistake - fix for potential crash if script is used on old data Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
author postprocessing now also contains a list of known copilot use names that can be extended to unify more different copilot users Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
the events 'copilot_work_started' and 'copilot_work_finished' now always have the standard copilot user data Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Method doc updated to reflect new functionality Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
previously, the creator of the issues was falsely matched to the connected event instead of the user triggering the event Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
| ## | ||
| # GLOBAL VARIABLES | ||
| ## | ||
|
|
||
| # global variable containing all known copilot users and the name and mail adress copilot users will be assigned | ||
| known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"} | ||
| copilot_unified_name = "Copilot" | ||
| copilot_unified_email = "copilot@example.com" | ||
|
|
There was a problem hiding this comment.
Can't we move this to a common file? At the moment, the "Copilot" user is used twice: Here and in issue_processing.py.
We could create a new directory called "GitHub_user_utils", and a file "GitHub_user_utils.py" where we store this information. Then we could just import these global variables (those that are actually used in a script) in each of the scripts:
from GitHub_user_utils import known_copilot_users (in issue_processing)
from GitHub_user_utils import known_copilot_users, copilot_unified_name, copilot_unified_email'' (in author_postprocessing)
Maybe we can also move the github_user and github_email constants (see line 111 below here in this file) and the helper function is_github_noreply_author there, not to have different user information distributed among different files.
| if unify_copilot_users and event[9] in known_copilot_users: | ||
| event[9] = copilot_unified_name | ||
| event[10] = copilot_unified_email |
There was a problem hiding this comment.
(1) Can we add a log statement here to state that we unify the copilot users? Similar to the ones removing GitHub user?
(2) Is this the only place where copilot users can occur? Shouldn't we do this to the authors_list and commits_list as well when copilot creates commits? Not sure about the other data sources as well, please check.
|
One more issue: The anonymization script cannot handle our manually introduced Copilot user as it is not part of the authors.list file: Solution: Add Copilot to the authors.list if it is not present there (during author_postprocessing). However, we might also think about whether we ignore it in anonymization. Please check whether we ignore the GitHub user in anonymization or not. |
| ## | ||
|
|
||
| # global variable containing all known copilot users and the name and mail adress copilot users will be assigned | ||
| known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"} |
There was a problem hiding this comment.
This is not sufficient: The ID service may remove special characters such as [ or ]. Thus, we might not be able to merge them correctly.
However, I also want this to be as independent as possible from the ID service., which is why I want to keep the current spelling variant that contains [ and ].
Suggestion 1: Add both variants: "copilot-pull-request-reviewer[bot]", "copilot-pull-request-reviewerbot"
Suggestion 2: As these square brackets are commonly used, we might only add it once to the list and then add a function that automatically generates the variants without [ and ] in addition. Then known_copilot_users should be the outcome of this function.
There was a problem hiding this comment.
We also need to perform a similar fix in bots processing here:
codeface-extraction/bot_processing/bot_processing.py
Lines 195 to 199 in b4fc74e
Here we don't consider the "bot" prefix well: user[0] is "copilot-swe-agent" or "codecov", but the corresponding key in user buffer is "copilot-swe-agentbot" or "codecovbot", respectively. So, we should check if user[0] or user[0] + "bot", or user[0] + "[bot]" is in the the user buffer...
| # unify events to use a single copilot user for all events triggered by a known copilot user | ||
| if unify_copilot_users and event[9] in known_copilot_users: | ||
| event[9] = copilot_unified_name | ||
| event[10] = copilot_unified_email |
There was a problem hiding this comment.
This is not enough!
(i) Think about "commit_added" events, for which the "copilot-swe-agentbot" is part of column 13. [Maybe this is a general oversight: Did we handle this case properly in both author post processing and issue processing? I guess this is newly introduced in this PR, so we should check this.]
(ii) Think about "mentioned" or "subscribed" events, for which the "copilot-swe-agentbot" is part of columns 12 (name) and 13 (email)
This pr adresses issue #52
Fixes this issue by adding the correct author to field 'event_info_2' in the issue data. This allows for reconstruction of the commit's author during author postprocessing, if the user that envoked the event is the 'GitHub' user.
Since parts of the changes take place in functions relying on codeface, testing all the changes is currently not possible.