-
Notifications
You must be signed in to change notification settings - Fork 10
Author reconstruction for 'GitHub' user #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
65f0a87
2e67f0d
c4f4af5
e77b009
c28b138
488626e
2894b0d
7ae5079
f44a8b7
e46af3f
cfaba71
fae4c47
89f0f01
73d5f64
befbee3
23f0dd6
eb53c79
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ | |
| # | ||
| # Copyright 2015-2017 by Claus Hunsen <hunsen@fim.uni-passau.de> | ||
| # Copyright 2020-2022 by Thomas Bock <bockthom@cs.uni-saarland.de> | ||
| # Copyright 2025-2026 by Leo Sendelbach <s8lesend@stud.uni-saarland.de> | ||
| # All Rights Reserved. | ||
| """ | ||
| This file is able to disambiguate authors after the extraction from the Codeface database was performed. A manually | ||
|
|
@@ -50,6 +51,15 @@ | |
| from csv_writer import csv_writer | ||
|
|
||
|
|
||
| ## | ||
| # GLOBAL VARIABLES | ||
| ## | ||
|
|
||
| # global variable containing all known copilot users and the name and mail adress copilot users will be assigned | ||
| known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"} | ||
| copilot_unified_name = "Copilot" | ||
| copilot_unified_email = "copilot@example.com" | ||
|
|
||
|
Comment on lines
+54
to
+62
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't we move this to a common file? At the moment, the "Copilot" user is used twice: Here and in issue_processing.py. We could create a new directory called "GitHub_user_utils", and a file "GitHub_user_utils.py" where we store this information. Then we could just import these global variables (those that are actually used in a script) in each of the scripts:
Maybe we can also move the |
||
| ## | ||
| # RUN POSTPROCESSING | ||
| ## | ||
|
|
@@ -78,7 +88,7 @@ def perform_data_backup(results_path, results_path_backup): | |
| copy(current_file, backup_file) | ||
|
|
||
|
|
||
| def fix_github_browser_commits(data_path, issues_github_list, commits_list, authors_list, emails_list, bots_list): | ||
| def fix_github_browser_commits(data_path, issues_github_list, commits_list, authors_list, emails_list, bots_list, unify_copilot_users=True): | ||
| """ | ||
| Replace the author "GitHub <noreply@github.com>" in both commit and GitHub issue data by the correct author. | ||
| The author "GitHub <noreply@github.com>" is automatically inserted as the committer of a commit that is made when | ||
|
|
@@ -89,14 +99,15 @@ def fix_github_browser_commits(data_path, issues_github_list, commits_list, auth | |
| "GitHub <noreply@github.com>" are removed. Also "mentioned" or "subscribed" events in the GitHub issue data which | ||
| reference the author "GitHub <noreply@github.com>" are removed from the GitHub issue data. In addition, remove the | ||
| author "GitHub <noreply@github.com>" also from the author data and bot data and remove e-mails that have been sent | ||
| by this author. | ||
| by this author. This method also unifies all known copilot users into a single user if desired. | ||
|
|
||
| :param data_path: the path to the project data that is to be fixed | ||
| :param issues_github_list: file name of the github issue data | ||
| :param commits_list: file name of the corresponding commit data | ||
| :param authors_list: file name of the corresponding author data | ||
| :param emails_list: file name of the corresponding email data | ||
| :param bots_list: file name of the corresponding bot data | ||
| :param unify_copilot_users: whether to unify known copilot users into a single user | ||
| """ | ||
| github_user = "GitHub" | ||
| github_email = "noreply@github.com" | ||
|
|
@@ -178,20 +189,30 @@ def is_github_noreply_author(name, email): | |
| commit_data_file = path.join(data_path, commits_list) | ||
| commit_data = csv_writer.read_from_csv(commit_data_file) | ||
| commit_hash_to_author = {commit[7]: commit[2:4] for commit in commit_data} | ||
|
|
||
| author_name_to_data = {author[1]: author[1:3] for author in author_data_new} | ||
| issue_data_new = [] | ||
|
|
||
| for event in issue_data: | ||
| # unify events to use a single copilot user for all events triggered by a known copilot user | ||
| if unify_copilot_users and event[9] in known_copilot_users: | ||
| event[9] = copilot_unified_name | ||
| event[10] = copilot_unified_email | ||
|
Comment on lines
+197
to
+199
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (1) Can we add a log statement here to state that we unify the copilot users? Similar to the ones removing GitHub user? (2) Is this the only place where copilot users can occur? Shouldn't we do this to the authors_list and commits_list as well when copilot creates commits? Not sure about the other data sources as well, please check.
Comment on lines
+196
to
+199
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not enough! (i) Think about "commit_added" events, for which the "copilot-swe-agentbot" is part of column 13. [Maybe this is a general oversight: Did we handle this case properly in both author post processing and issue processing? I guess this is newly introduced in this PR, so we should check this.] (ii) Think about "mentioned" or "subscribed" events, for which the "copilot-swe-agentbot" is part of columns 12 (name) and 13 (email) |
||
|
|
||
| # replace author if necessary | ||
| if is_github_noreply_author(event[9], event[10]) and event[8] == commit_added_event: | ||
| # extract commit hash from event info 1 | ||
| commit_hash = event[12] | ||
|
|
||
| # extract author name from event info 2 while cutting excess '"' | ||
| name = event[13][1:-1] | ||
bockthom marked this conversation as resolved.
Show resolved
Hide resolved
Leo-Send marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # extract commit author from commit data, if available | ||
| if commit_hash in commit_hash_to_author: | ||
| event[9] = commit_hash_to_author[commit_hash][0] | ||
| event[10] = commit_hash_to_author[commit_hash][1] | ||
| issue_data_new.append(event) | ||
| elif name in author_name_to_data: | ||
| event[9] = author_name_to_data[name][0] | ||
| event[10] = author_name_to_data[name][1] | ||
| issue_data_new.append(event) | ||
| else: | ||
| # the added commit is not part of the commit data. In most cases, this is due to merge commits | ||
| # appearing in another pull request, as Codeface does not keep track of merge commits. As we | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not sufficient: The ID service may remove special characters such as [ or ]. Thus, we might not be able to merge them correctly.
However, I also want this to be as independent as possible from the ID service., which is why I want to keep the current spelling variant that contains [ and ].
Suggestion 1: Add both variants: "copilot-pull-request-reviewer[bot]", "copilot-pull-request-reviewerbot"
Suggestion 2: As these square brackets are commonly used, we might only add it once to the list and then add a function that automatically generates the variants without [ and ] in addition. Then
known_copilot_usersshould be the outcome of this function.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to perform a similar fix in bots processing here:
codeface-extraction/bot_processing/bot_processing.py
Lines 195 to 199 in b4fc74e
Here we don't consider the "bot" prefix well: user[0] is "copilot-swe-agent" or "codecov", but the corresponding key in user buffer is "copilot-swe-agentbot" or "codecovbot", respectively. So, we should check if user[0] or user[0] + "bot", or user[0] + "[bot]" is in the the user buffer...