Bump to python3 and remove codeface dependency where possible by maxloeffler · Pull Request #54 · se-sic/codeface-extraction

maxloeffler · 2025-08-30T16:20:30Z

Changes:

Make the entire codebase compatible with python3 using the python module modernize.
Remove the dependency on a running codeface instance by encapsulating all necessary codeface fragments into the codebase.
Remove the dependency on codeface ID databases by adding an alternative ID manager service that operates on .csv files. The codeface ID database will still be required to perform the extraction process.

Open Ties:

How should I deal with the copyright header in the included codeface fragments?
Due to a lack of data I could not yet test all of codeface-extraction's functionality.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

bockthom

Thanks @maxloeffler for helping us with bumping the code to python3 and getting partly independent of Codeface.

Regarding the files taken from Codeface: Can we remove the functions that we don't need? For example, PersonInfo.py contains classes like RelationWeight or RelationWeights or functions like addReceiveRelation - stuff that we don't use here. Can we get rid of all that stuff (in best case, via rebase) and only keep the stuff we need? (Only exception: For idManager and dbmanager we might need to keep everything for completeness reasons).

Regarding the copyright headers: They are entirely outdated in Codeface, nobody has ever updated them (even not me). Technically, we would need to figure out who has contributed to the lines that we take, but not sure. When we take it, it becomes part of codeface-extraction. Maybe we should add a link to the corresponding file in codeface and state that this code originates from there.

And for the dbmanager we should mark what is the original code and what is the new csv-manager you have added.

This may need further discussion.

bockthom · 2025-09-08T01:43:09Z

issue_processing/issue_processing.py

        # fix encoding for name and e-mail address
        if user["name"] is not None:
-            name = str(user["name"]).encode("utf-8")
+            name = user["name"]
        else:
-            name = username
-        mail = str(user["email"]).encode("utf-8")
+            name = username.decode("utf-8")
+        mail = user["email"]


I don't get this change. Can you please explain this to me @maxloeffler?
Why do we need to call username.decode("utf-8")? And why can we be sure that user["email"] and user["name"] are already in utf-8?
(not only here, same for jira_issue_processing)

To be honest, I can't be sure whether this solution is generally applicable, thats another reason why it would be nice to have some more testing data. The reason is that with the data I have, user["name"] is already of type str here. Calling encode on it turns it into bytes and this leads to the incorrect names with leading bs in the output csv file that we discussed in the last meetings.

What fortifies me in my expression that this encoding shift is correct is that directly after the lines you commented on, there is a call get_user_string(name, mail) in which concats name and mail into one user_string. For this concatenation to work as expected name and mail must be a str not bytes.

bockthom · 2025-09-08T01:44:12Z

issue_processing/jira_issue_processing.py

-            name = str(user["username"]).encode("utf-8")
-        mail = str(user["email"]).encode("utf-8")  # empty
+            name = username.decode("utf-8")
+        mail = user["email"]


Please keep the "empty" comment. Email might still be empty in this case

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Using the csvIdManager, we can use all functionality but extraction without a codeface installation. The csvIdManager then replaces the MySQL-based IdManager from codeface. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

bockthom · 2025-09-30T05:55:47Z

You did not add your name to the copyright header in a number of files which you have changed just to convert everything from Python2 to Python3. Technically speaking, the Python-version-related changes are changes to the file, so you should add your name to all files in which at least one line has changed.

Or do you have a different opinion on that @maxloeffler @hechtlC? Let's discuss this!

Copilot

Pull Request Overview

This pull request modernizes the codebase to be compatible with Python 3 and reduces dependencies on the codeface framework. The main changes include replacing Python 2 syntax with Python 3 equivalents, updating string handling for Unicode compatibility, and incorporating necessary codeface components directly into the project.

Key changes:

Modernized Python 2 to Python 3 syntax (imports, string handling, iterator methods)
Removed direct codeface dependencies by creating codeface_utils package with essential components
Updated CSV file handling and string encoding for Python 3 compatibility

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
mbox_parsing/mbox_parsing.py	Updated imports, logging, and string handling for Python 3
issue_processing/jira_issue_processing.py	Modernized syntax, imports, and ID manager usage
issue_processing/issue_processing.py	Updated imports, string handling, and database interaction patterns
csv_writer/csv_writer.py	Simplified CSV writing for Python 3 with proper encoding
codeface_utils/*.py	New utility modules extracted from codeface framework
codeface_extraction/*.py	Updated imports and string handling
bot_processing/bot_processing.py	Modernized imports and logging calls
author_postprocessing/author_postprocessing.py	Updated imports and syntax
anonymization/anonymization.py	Modernized imports and dictionary operations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-30T05:57:50Z

issue_processing/jira_issue_processing.py

+importlib.reload(sys)
+
+


The reload(sys) call is unnecessary and potentially problematic in Python 3. The sys.setdefaultencoding method that this was likely trying to enable doesn't exist in Python 3, and reloading the sys module can cause unexpected behavior.

Suggested change

importlib.reload(sys)

Copilot · 2025-09-30T05:57:50Z

issue_processing/issue_processing.py

+        user["email"] = person.getEmail()  # column "email1"
+        user["name"] = person.getName()  # column "name"
+        user["id"] = person.getID()  # column "id"


The code expects person to be a PersonInfo object with getter methods, but the database query likely returns a dictionary. This will cause AttributeError when trying to call .getEmail(), .getName(), and .getID() methods on a dictionary.

Suggested change

user["email"] = person.getEmail() # column "email1"

user["name"] = person.getName() # column "name"

user["id"] = person.getID() # column "id"

user["email"] = person["email"] # column "email1"

user["name"] = person["name"] # column "name"

user["id"] = person["id"] # column "id"

Very nice catch, there was indeed an inconsistency in the return types in getPersonFromID between the dbIdManager and the csvIdManager. Through making it consistent, I could remove another codeface file, namely PersonInfo.

Copilot · 2025-09-30T05:57:50Z

codeface_extraction/extractions.py

+    new_text = re.sub(r"\\ufff.", " ", new_text).encode("unicode-escape")

    # remove all kinds of control characters and emojis
    # see: https://www.fileformat.info/info/unicode/category/index.htm
-    new_text = u"".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape"))
+    new_text = "".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape"))


The order of operations is incorrect. The regex substitution should be applied after decoding from unicode-escape, not before encoding. This will not properly remove the unicode specials characters as intended.

Copilot · 2025-09-30T05:57:51Z

codeface_extraction/extractions.py

+        return str(name.decode('utf-8'))
    except LookupError:
        # Encoding not found, return string as is
        return name


This unreachable code after the except blocks will never execute, making the function incomplete in some error cases.

I am unsure what Copilot wants to tell us here. It is true that the except cases are not generic enough to cover all types of exceptions but they cover the expected ones and when an unexpected exception gets raised, then it should probably fly through. Also I am not sure which code it refers to here as "unreachable".

Also this section has been like this prior to my PR.

Copilot · 2025-09-30T05:57:51Z

csv_writer/csv_writer.py

+        if type(column) is str:
            lineres += (column.encode("utf-8"),)


The __encode function is still being used but is no longer needed in Python 3 since strings are Unicode by default. The encoding to UTF-8 bytes will cause issues when writing to a text file opened with UTF-8 encoding.

Actually, the function is not being used anymore, so let me remove it.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

bockthom · 2025-09-30T18:00:17Z

issue_processing/issue_processing.py

-        person = idservice.getPersonFromDB(idx)
-        user = dict()
-        user["email"] = person.getEmail()  # column "email1"
-        user["name"] = person.getName()  # column "name"
-        user["id"] = person.getID()  # column "id"
+        user = idservice.getPersonFromDB(idx)
+        user["email"] = user.pop("email1")


Hm, I am wondering whether this returns the correct user dict. I guess user will have more information than user previously had, as previously it had only the fields email, name, and id. But now it has all fields that are present in the person object that is obtained form the database.

This might impact the amount of data that is being held in memory, and also could lead to problems later on when writing the data to disk if not only the relevant three fields are selected. Could you please check what happens with all the useless attributes, are they used/removed later on? (This is not a request to change this already, please check that first and reply before changing this.) Applies similarly also to jira_issue_processing.py

Oh thats something that I didn't even consider. While newly added users should never have any data fields but email, id, and name that may not be the case when using prefilled codeface databases. I cannot practically test this but the definition of the person table that is being queried by the dbIdManager here indicates the potential presence of more columns. With your approval, I would revert the code sequence to exclude additional columns, e.g.:

person = idservice.getPersonFromDB(idx) user = { "name": person["name"], "email": person["email1"], "id": person["id"] }

Thanks for checking. I agree with your suggestion.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

If useCsv is not set in the config file, set its default to False. Without setting a default, it would always end up in a KeyError when accessing the non-existing entry. Signed-off-by: Thomas Bock <bockthom@cmu.edu>

Configure the new logger to log at INFO level (otherwise nothing would be logged in a normal run), and format the log output similarly to how it was formatted previously in python2. Signed-off-by: Thomas Bock <bockthom@cmu.edu>

bockthom · 2026-02-02T05:34:36Z

While trying the python3 version in action for the very first time, I figured out the following two problems:

The newly introduced config variable useCsv did not have a default value. When this variable is missing in the database config file, the scripts ran into a KeyError as the useCsv key was not present in the config dictionary. I fixed this by setting this entry to False if no such key is present. This way we can run the script as usual without the need to change the config files.
The new logger did not log anything. Reason: The default log level of the new logger that @maxloeffler has added is WARN. However, we want to get the INFO level logs as previously for the python2 version. To do so, I added a setup function to codeface_util/utils.py that format the log output and defaults to INFO level. And I now call this setup function from all main scripts when initializing the logger.

I have already added two commits to this PR that fix these two issues.

I will continue adding commits to this PR (i.e., @maxloeffler's master branch) if I encounter any additional issues that were introduced in this PR in case they need to be fixed prior to merging this PR.

maxloeffler added 3 commits August 4, 2025 17:50

Automatically convert to python3 (using 2to3)

fd0a8ed

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Import codeface files required for extraction

d64832e

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Remove dependency on codeface

96c4208

bockthom reviewed Sep 8, 2025

View reviewed changes

maxloeffler added 3 commits September 8, 2025 10:46

Rename folder for codeface fragments

454efa3

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Introduce CSV-based IdManager class

3472384

Using the csvIdManager, we can use all functionality but extraction without a codeface installation. The csvIdManager then replaces the MySQL-based IdManager from codeface. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Adapt usage of codeface fragments in all modules

202a188

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the master branch from 8dc3596 to ab21443 Compare September 8, 2025 08:51

Increase consistency with pyhton3 conventions

9df79c0

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the master branch from ab21443 to 14e9fda Compare September 18, 2025 17:18

bockthom requested a review from Copilot September 30, 2025 05:56

Copilot AI reviewed Sep 30, 2025

View reviewed changes

maxloeffler force-pushed the master branch from 14e9fda to 08f61a7 Compare September 30, 2025 11:48

maxloeffler added 3 commits September 30, 2025 14:06

Update copyright headers

e9372fe

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Remove unused code artifacts

eef2c8b

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Ensure string represenation of all user data in 'get_id_and_update_user'

f6bc6a2

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the master branch from 08f61a7 to 64239b2 Compare September 30, 2025 12:06

bockthom reviewed Sep 30, 2025

View reviewed changes

Use the same return format in all implementations of 'getPersonFromDB'

41822da

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the master branch from 64239b2 to 41822da Compare October 2, 2025 19:07

bockthom added 2 commits February 2, 2026 06:10

Properly initialize new config entry 'useCsv'

c3272fe

If useCsv is not set in the config file, set its default to False. Without setting a default, it would always end up in a KeyError when accessing the non-existing entry. Signed-off-by: Thomas Bock <bockthom@cmu.edu>

Setup logger to log INFO logs and format log out

cf23199

Configure the new logger to log at INFO level (otherwise nothing would be logged in a normal run), and format the log output similarly to how it was formatted previously in python2. Signed-off-by: Thomas Bock <bockthom@cmu.edu>

Conversation

maxloeffler commented Aug 30, 2025

Changes:

Open Ties:

Uh oh!

bockthom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bockthom commented Sep 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bockthom commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants