Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions author_postprocessing/author_postprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
#
# Copyright 2015-2017 by Claus Hunsen <hunsen@fim.uni-passau.de>
# Copyright 2020-2022 by Thomas Bock <bockthom@cs.uni-saarland.de>
# Copyright 2025-2026 by Leo Sendelbach <s8lesend@stud.uni-saarland.de>
# All Rights Reserved.
"""
This file is able to disambiguate authors after the extraction from the Codeface database was performed. A manually
Expand Down Expand Up @@ -50,6 +51,15 @@
from csv_writer import csv_writer


##
# GLOBAL VARIABLES
##

# global variable containing all known copilot users and the name and mail adress copilot users will be assigned
known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient: The ID service may remove special characters such as [ or ]. Thus, we might not be able to merge them correctly.

However, I also want this to be as independent as possible from the ID service., which is why I want to keep the current spelling variant that contains [ and ].

Suggestion 1: Add both variants: "copilot-pull-request-reviewer[bot]", "copilot-pull-request-reviewerbot"

Suggestion 2: As these square brackets are commonly used, we might only add it once to the list and then add a function that automatically generates the variants without [ and ] in addition. Then known_copilot_users should be the outcome of this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to perform a similar fix in bots processing here:

if user[0] in user_buffer.keys():
bot_reduced["user"] = user_buffer[user[0]]
bot_reduced["prediction"] = user[-1]
bot_data_reduced.append(bot_reduced)
else:

Here we don't consider the "bot" prefix well: user[0] is "copilot-swe-agent" or "codecov", but the corresponding key in user buffer is "copilot-swe-agentbot" or "codecovbot", respectively. So, we should check if user[0] or user[0] + "bot", or user[0] + "[bot]" is in the the user buffer...

copilot_unified_name = "Copilot"
copilot_unified_email = "copilot@example.com"

Comment on lines +54 to +62
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move this to a common file? At the moment, the "Copilot" user is used twice: Here and in issue_processing.py.

We could create a new directory called "GitHub_user_utils", and a file "GitHub_user_utils.py" where we store this information. Then we could just import these global variables (those that are actually used in a script) in each of the scripts:

from GitHub_user_utils import known_copilot_users (in issue_processing)
from GitHub_user_utils import known_copilot_users, copilot_unified_name, copilot_unified_email'' (in author_postprocessing)

Maybe we can also move the github_user and github_email constants (see line 111 below here in this file) and the helper function is_github_noreply_author there, not to have different user information distributed among different files.

##
# RUN POSTPROCESSING
##
Expand Down Expand Up @@ -78,7 +88,7 @@ def perform_data_backup(results_path, results_path_backup):
copy(current_file, backup_file)


def fix_github_browser_commits(data_path, issues_github_list, commits_list, authors_list, emails_list, bots_list):
def fix_github_browser_commits(data_path, issues_github_list, commits_list, authors_list, emails_list, bots_list, unify_copilot_users=True):
"""
Replace the author "GitHub <noreply@github.com>" in both commit and GitHub issue data by the correct author.
The author "GitHub <noreply@github.com>" is automatically inserted as the committer of a commit that is made when
Expand All @@ -89,14 +99,15 @@ def fix_github_browser_commits(data_path, issues_github_list, commits_list, auth
"GitHub <noreply@github.com>" are removed. Also "mentioned" or "subscribed" events in the GitHub issue data which
reference the author "GitHub <noreply@github.com>" are removed from the GitHub issue data. In addition, remove the
author "GitHub <noreply@github.com>" also from the author data and bot data and remove e-mails that have been sent
by this author.
by this author. This method also unifies all known copilot users into a single user if desired.

:param data_path: the path to the project data that is to be fixed
:param issues_github_list: file name of the github issue data
:param commits_list: file name of the corresponding commit data
:param authors_list: file name of the corresponding author data
:param emails_list: file name of the corresponding email data
:param bots_list: file name of the corresponding bot data
:param unify_copilot_users: whether to unify known copilot users into a single user
"""
github_user = "GitHub"
github_email = "noreply@github.com"
Expand Down Expand Up @@ -178,20 +189,30 @@ def is_github_noreply_author(name, email):
commit_data_file = path.join(data_path, commits_list)
commit_data = csv_writer.read_from_csv(commit_data_file)
commit_hash_to_author = {commit[7]: commit[2:4] for commit in commit_data}

author_name_to_data = {author[1]: author[1:3] for author in author_data_new}
issue_data_new = []

for event in issue_data:
# unify events to use a single copilot user for all events triggered by a known copilot user
if unify_copilot_users and event[9] in known_copilot_users:
event[9] = copilot_unified_name
event[10] = copilot_unified_email
Comment on lines +197 to +199
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Can we add a log statement here to state that we unify the copilot users? Similar to the ones removing GitHub user?

(2) Is this the only place where copilot users can occur? Shouldn't we do this to the authors_list and commits_list as well when copilot creates commits? Not sure about the other data sources as well, please check.

Comment on lines +196 to +199
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough!

(i) Think about "commit_added" events, for which the "copilot-swe-agentbot" is part of column 13. [Maybe this is a general oversight: Did we handle this case properly in both author post processing and issue processing? I guess this is newly introduced in this PR, so we should check this.]

(ii) Think about "mentioned" or "subscribed" events, for which the "copilot-swe-agentbot" is part of columns 12 (name) and 13 (email)


# replace author if necessary
if is_github_noreply_author(event[9], event[10]) and event[8] == commit_added_event:
# extract commit hash from event info 1
commit_hash = event[12]

# extract author name from event info 2 while cutting excess '"'
name = event[13][1:-1]
# extract commit author from commit data, if available
if commit_hash in commit_hash_to_author:
event[9] = commit_hash_to_author[commit_hash][0]
event[10] = commit_hash_to_author[commit_hash][1]
issue_data_new.append(event)
elif name in author_name_to_data:
event[9] = author_name_to_data[name][0]
event[10] = author_name_to_data[name][1]
issue_data_new.append(event)
else:
# the added commit is not part of the commit data. In most cases, this is due to merge commits
# appearing in another pull request, as Codeface does not keep track of merge commits. As we
Expand Down
Loading