Skip to content

Bump to python3 and remove codeface dependency where possible#54

Open
maxloeffler wants to merge 13 commits intose-sic:masterfrom
maxloeffler:master
Open

Bump to python3 and remove codeface dependency where possible#54
maxloeffler wants to merge 13 commits intose-sic:masterfrom
maxloeffler:master

Conversation

@maxloeffler
Copy link
Contributor

Changes:

  • Make the entire codebase compatible with python3 using the python module modernize.
  • Remove the dependency on a running codeface instance by encapsulating all necessary codeface fragments into the codebase.
  • Remove the dependency on codeface ID databases by adding an alternative ID manager service that operates on .csv files. The codeface ID database will still be required to perform the extraction process.

Open Ties:

  • How should I deal with the copyright header in the included codeface fragments?
  • Due to a lack of data I could not yet test all of codeface-extraction's functionality.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maxloeffler for helping us with bumping the code to python3 and getting partly independent of Codeface.

Regarding the files taken from Codeface: Can we remove the functions that we don't need? For example, PersonInfo.py contains classes like RelationWeight or RelationWeights or functions like addReceiveRelation - stuff that we don't use here. Can we get rid of all that stuff (in best case, via rebase) and only keep the stuff we need? (Only exception: For idManager and dbmanager we might need to keep everything for completeness reasons).

Regarding the copyright headers: They are entirely outdated in Codeface, nobody has ever updated them (even not me). Technically, we would need to figure out who has contributed to the lines that we take, but not sure. When we take it, it becomes part of codeface-extraction. Maybe we should add a link to the corresponding file in codeface and state that this code originates from there.

And for the dbmanager we should mark what is the original code and what is the new csv-manager you have added.

This may need further discussion.

Comment on lines 694 to 696
# fix encoding for name and e-mail address
if user["name"] is not None:
name = str(user["name"]).encode("utf-8")
name = user["name"]
else:
name = username
mail = str(user["email"]).encode("utf-8")
name = username.decode("utf-8")
mail = user["email"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this change. Can you please explain this to me @maxloeffler?
Why do we need to call username.decode("utf-8")? And why can we be sure that user["email"] and user["name"] are already in utf-8?
(not only here, same for jira_issue_processing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I can't be sure whether this solution is generally applicable, thats another reason why it would be nice to have some more testing data. The reason is that with the data I have, user["name"] is already of type str here. Calling encode on it turns it into bytes and this leads to the incorrect names with leading bs in the output csv file that we discussed in the last meetings.

What fortifies me in my expression that this encoding shift is correct is that directly after the lines you commented on, there is a call get_user_string(name, mail) in which concats name and mail into one user_string. For this concatenation to work as expected name and mail must be a str not bytes.

Comment on lines 617 to 613
name = str(user["username"]).encode("utf-8")
mail = str(user["email"]).encode("utf-8") # empty
name = username.decode("utf-8")
mail = user["email"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the "empty" comment. Email might still be empty in this case

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Using the csvIdManager, we can use all functionality but extraction
without a codeface installation. The csvIdManager then replaces the
MySQL-based IdManager from codeface.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
@bockthom
Copy link
Collaborator

You did not add your name to the copyright header in a number of files which you have changed just to convert everything from Python2 to Python3. Technically speaking, the Python-version-related changes are changes to the file, so you should add your name to all files in which at least one line has changed.

Or do you have a different opinion on that @maxloeffler @hechtlC? Let's discuss this!

@bockthom bockthom requested a review from Copilot September 30, 2025 05:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request modernizes the codebase to be compatible with Python 3 and reduces dependencies on the codeface framework. The main changes include replacing Python 2 syntax with Python 3 equivalents, updating string handling for Unicode compatibility, and incorporating necessary codeface components directly into the project.

Key changes:

  • Modernized Python 2 to Python 3 syntax (imports, string handling, iterator methods)
  • Removed direct codeface dependencies by creating codeface_utils package with essential components
  • Updated CSV file handling and string encoding for Python 3 compatibility

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mbox_parsing/mbox_parsing.py Updated imports, logging, and string handling for Python 3
issue_processing/jira_issue_processing.py Modernized syntax, imports, and ID manager usage
issue_processing/issue_processing.py Updated imports, string handling, and database interaction patterns
csv_writer/csv_writer.py Simplified CSV writing for Python 3 with proper encoding
codeface_utils/*.py New utility modules extracted from codeface framework
codeface_extraction/*.py Updated imports and string handling
bot_processing/bot_processing.py Modernized imports and logging calls
author_postprocessing/author_postprocessing.py Updated imports and syntax
anonymization/anonymization.py Modernized imports and dictionary operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 47 to 49
importlib.reload(sys)


Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reload(sys) call is unnecessary and potentially problematic in Python 3. The sys.setdefaultencoding method that this was likely trying to enable doesn't exist in Python 3, and reloading the sys module can cause unexpected behavior.

Suggested change
importlib.reload(sys)

Copilot uses AI. Check for mistakes.
Comment on lines 730 to 732
user["email"] = person.getEmail() # column "email1"
user["name"] = person.getName() # column "name"
user["id"] = person.getID() # column "id"
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code expects person to be a PersonInfo object with getter methods, but the database query likely returns a dictionary. This will cause AttributeError when trying to call .getEmail(), .getName(), and .getID() methods on a dictionary.

Suggested change
user["email"] = person.getEmail() # column "email1"
user["name"] = person.getName() # column "name"
user["id"] = person.getID() # column "id"
user["email"] = person["email"] # column "email1"
user["name"] = person["name"] # column "name"
user["id"] = person["id"] # column "id"

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch, there was indeed an inconsistency in the return types in getPersonFromID between the dbIdManager and the csvIdManager. Through making it consistent, I could remove another codeface file, namely PersonInfo.

Comment on lines +742 to +747
new_text = re.sub(r"\\ufff.", " ", new_text).encode("unicode-escape")

# remove all kinds of control characters and emojis
# see: https://www.fileformat.info/info/unicode/category/index.htm
new_text = u"".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape"))
new_text = "".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape"))
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of operations is incorrect. The regex substitution should be applied after decoding from unicode-escape, not before encoding. This will not properly remove the unicode specials characters as intended.

Copilot uses AI. Check for mistakes.
return str(name.decode('utf-8'))
except LookupError:
# Encoding not found, return string as is
return name
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unreachable code after the except blocks will never execute, making the function incomplete in some error cases.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure what Copilot wants to tell us here. It is true that the except cases are not generic enough to cover all types of exceptions but they cover the expected ones and when an unexpected exception gets raised, then it should probably fly through. Also I am not sure which code it refers to here as "unreachable".

Also this section has been like this prior to my PR.

Comment on lines 31 to 32
if type(column) is str:
lineres += (column.encode("utf-8"),)
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __encode function is still being used but is no longer needed in Python 3 since strings are Unicode by default. The encoding to UTF-8 bytes will cause issues when writing to a text file opened with UTF-8 encoding.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the function is not being used anymore, so let me remove it.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Comment on lines 729 to 730
person = idservice.getPersonFromDB(idx)
user = dict()
user["email"] = person.getEmail() # column "email1"
user["name"] = person.getName() # column "name"
user["id"] = person.getID() # column "id"
user = idservice.getPersonFromDB(idx)
user["email"] = user.pop("email1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am wondering whether this returns the correct user dict. I guess user will have more information than user previously had, as previously it had only the fields email, name, and id. But now it has all fields that are present in the person object that is obtained form the database.

This might impact the amount of data that is being held in memory, and also could lead to problems later on when writing the data to disk if not only the relevant three fields are selected. Could you please check what happens with all the useless attributes, are they used/removed later on? (This is not a request to change this already, please check that first and reply before changing this.) Applies similarly also to jira_issue_processing.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thats something that I didn't even consider. While newly added users should never have any data fields but email, id, and name that may not be the case when using prefilled codeface databases. I cannot practically test this but the definition of the person table that is being queried by the dbIdManager here indicates the potential presence of more columns. With your approval, I would revert the code sequence to exclude additional columns, e.g.:

person = idservice.getPersonFromDB(idx)
user = {
    "name": person["name"],
    "email": person["email1"],
    "id": person["id"]
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I agree with your suggestion.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
If useCsv is not set in the config file, set its default to False.
Without setting a default, it would always end up in a KeyError
when accessing the non-existing entry.

Signed-off-by: Thomas Bock <bockthom@cmu.edu>
Configure the new logger to log at INFO level (otherwise nothing would be
logged in a normal run), and format the log output similarly to how it was
formatted previously in python2.

Signed-off-by: Thomas Bock <bockthom@cmu.edu>
@bockthom
Copy link
Collaborator

bockthom commented Feb 2, 2026

While trying the python3 version in action for the very first time, I figured out the following two problems:

  1. The newly introduced config variable useCsv did not have a default value. When this variable is missing in the database config file, the scripts ran into a KeyError as the useCsv key was not present in the config dictionary. I fixed this by setting this entry to False if no such key is present. This way we can run the script as usual without the need to change the config files.
  2. The new logger did not log anything. Reason: The default log level of the new logger that @maxloeffler has added is WARN. However, we want to get the INFO level logs as previously for the python2 version. To do so, I added a setup function to codeface_util/utils.py that format the log output and defaults to INFO level. And I now call this setup function from all main scripts when initializing the logger.

I have already added two commits to this PR that fix these two issues.

I will continue adding commits to this PR (i.e., @maxloeffler's master branch) if I encounter any additional issues that were introduced in this PR in case they need to be fixed prior to merging this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants