Skip to content

Author reconstruction for 'GitHub' user#53

Open
Leo-Send wants to merge 17 commits intose-sic:masterfrom
Leo-Send:master
Open

Author reconstruction for 'GitHub' user#53
Leo-Send wants to merge 17 commits intose-sic:masterfrom
Leo-Send:master

Conversation

@Leo-Send
Copy link

@Leo-Send Leo-Send commented Aug 25, 2025

This pr adresses issue #52

Fixes this issue by adding the correct author to field 'event_info_2' in the issue data. This allows for reconstruction of the commit's author during author postprocessing, if the user that envoked the event is the 'GitHub' user.

Since parts of the changes take place in functions relying on codeface, testing all the changes is currently not possible.

This allows for reconstruction of correct commit author if user is
github

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@Leo-Send Leo-Send changed the title Add commit author of 'commit_added' events to event info Author reconstruction for 'GitHub' user Aug 25, 2025
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @Leo-Send.
I don't have run your changes yet, but here are already two comments:

Could you please update your copyright header in the files you have changed?

also added one comment for clarity

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@Leo-Send
Copy link
Author

This PR should also be ready for review. I was unable to test my changes in python 2, but since we are porting to python 3 in the near future I hope this is not an issue. I do also have the tested python 3 code locally, which could make @maxloeffler work a bit easier.

@bockthom bockthom requested a review from Copilot September 30, 2025 05:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses issue #52 by implementing author reconstruction for events triggered by the 'GitHub' user. The main purpose is to capture and preserve commit author information in event data so that the actual commit author can be reconstructed during post-processing, even when the event appears to be triggered by the generic 'GitHub' user.

  • Adds commit author information to event_info_2 field for commit_added events
  • Implements connected events filtering and matching logic to handle related issue connections
  • Updates author post-processing to use the preserved author information as a fallback when commit data is unavailable

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
issue_processing/issue_processing.py Adds author preservation logic, connected events handling, and updates user processing to include commit author information
author_postprocessing/author_postprocessing.py Implements fallback author reconstruction using preserved author names when commit hash lookup fails

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please address the following issues, as well as the issues previously pointed out by Copilot, as well as the other unresolved one regarding necessary comments from a previous review of mine?

Thank you!

@bockthom bockthom requested a review from Copilot October 21, 2025 01:24
@bockthom
Copy link
Collaborator

And the copyright header in issue_processing/jira_issue_processing.py needs to be updated as well.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

also save merge commits
reconstruction of connected events is done by first saving all connected
events that occured at the same time. Then, it is possible to match
connected events iff:
- half of the involved issues are equal, meaning that one issue is
  connected to multiple others
- half rounded up of the involved isses are equal, meaning that we have
  one external connected event and then the previous case with the
remaining issues

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
since data is modified in-place, return of input data is not needed

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
ALso add commit hash if closed by commit

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also rename 'new feature' to 'feature'

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also remove duplicates from type list

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
using empty line reserved for jira components

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added copyright header

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also minor fixes and removal of math.ceil

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
comments now each have a boolean field that describes whether the
comment contains a suggestion or not

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
dicts for reconstructing connected events are now better explained and
the comments do not disruot the workflow in the run function anymore

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@bockthom bockthom requested a review from Copilot November 4, 2025 08:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

includes:
- updated comments
- spelling mistake
- fix for potential crash if script is used on old data

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
author postprocessing now also contains a list of known copilot use
names that can be extended to unify more different copilot users

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
the events 'copilot_work_started' and 'copilot_work_finished' now always
have the standard copilot user data

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Method doc updated to reflect new functionality

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
previously, the creator of the issues was falsely matched to the
connected event instead of the user triggering the event

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Comment on lines +54 to +62
##
# GLOBAL VARIABLES
##

# global variable containing all known copilot users and the name and mail adress copilot users will be assigned
known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"}
copilot_unified_name = "Copilot"
copilot_unified_email = "copilot@example.com"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move this to a common file? At the moment, the "Copilot" user is used twice: Here and in issue_processing.py.

We could create a new directory called "GitHub_user_utils", and a file "GitHub_user_utils.py" where we store this information. Then we could just import these global variables (those that are actually used in a script) in each of the scripts:

from GitHub_user_utils import known_copilot_users (in issue_processing)
from GitHub_user_utils import known_copilot_users, copilot_unified_name, copilot_unified_email'' (in author_postprocessing)

Maybe we can also move the github_user and github_email constants (see line 111 below here in this file) and the helper function is_github_noreply_author there, not to have different user information distributed among different files.

Comment on lines +197 to +199
if unify_copilot_users and event[9] in known_copilot_users:
event[9] = copilot_unified_name
event[10] = copilot_unified_email
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Can we add a log statement here to state that we unify the copilot users? Similar to the ones removing GitHub user?

(2) Is this the only place where copilot users can occur? Shouldn't we do this to the authors_list and commits_list as well when copilot creates commits? Not sure about the other data sources as well, please check.

@bockthom
Copy link
Collaborator

bockthom commented Feb 2, 2026

One more issue:

The anonymization script cannot handle our manually introduced Copilot user as it is not part of the authors.list file:

File "anonymization/anonymization.py", line 236, in run_anonymization
    new_author = author_to_anonymized_author[(issue_event[9], issue_event[10])]
KeyError: ('Copilot', 'copilot@example.com')

Solution: Add Copilot to the authors.list if it is not present there (during author_postprocessing).

However, we might also think about whether we ignore it in anonymization. Please check whether we ignore the GitHub user in anonymization or not.

##

# global variable containing all known copilot users and the name and mail adress copilot users will be assigned
known_copilot_users = {"Copilot", "copilot-pull-request-reviewer[bot]", "copilot-swe-agentbot"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient: The ID service may remove special characters such as [ or ]. Thus, we might not be able to merge them correctly.

However, I also want this to be as independent as possible from the ID service., which is why I want to keep the current spelling variant that contains [ and ].

Suggestion 1: Add both variants: "copilot-pull-request-reviewer[bot]", "copilot-pull-request-reviewerbot"

Suggestion 2: As these square brackets are commonly used, we might only add it once to the list and then add a function that automatically generates the variants without [ and ] in addition. Then known_copilot_users should be the outcome of this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to perform a similar fix in bots processing here:

if user[0] in user_buffer.keys():
bot_reduced["user"] = user_buffer[user[0]]
bot_reduced["prediction"] = user[-1]
bot_data_reduced.append(bot_reduced)
else:

Here we don't consider the "bot" prefix well: user[0] is "copilot-swe-agent" or "codecov", but the corresponding key in user buffer is "copilot-swe-agentbot" or "codecovbot", respectively. So, we should check if user[0] or user[0] + "bot", or user[0] + "[bot]" is in the the user buffer...

Comment on lines +196 to +199
# unify events to use a single copilot user for all events triggered by a known copilot user
if unify_copilot_users and event[9] in known_copilot_users:
event[9] = copilot_unified_name
event[10] = copilot_unified_email
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough!

(i) Think about "commit_added" events, for which the "copilot-swe-agentbot" is part of column 13. [Maybe this is a general oversight: Did we handle this case properly in both author post processing and issue processing? I guess this is newly introduced in this PR, so we should check this.]

(ii) Think about "mentioned" or "subscribed" events, for which the "copilot-swe-agentbot" is part of columns 12 (name) and 13 (email)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants