Bump to python3 and remove codeface dependency where possible#54
Bump to python3 and remove codeface dependency where possible#54maxloeffler wants to merge 13 commits intose-sic:masterfrom
Conversation
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
bockthom
left a comment
There was a problem hiding this comment.
Thanks @maxloeffler for helping us with bumping the code to python3 and getting partly independent of Codeface.
Regarding the files taken from Codeface: Can we remove the functions that we don't need? For example, PersonInfo.py contains classes like RelationWeight or RelationWeights or functions like addReceiveRelation - stuff that we don't use here. Can we get rid of all that stuff (in best case, via rebase) and only keep the stuff we need? (Only exception: For idManager and dbmanager we might need to keep everything for completeness reasons).
Regarding the copyright headers: They are entirely outdated in Codeface, nobody has ever updated them (even not me). Technically, we would need to figure out who has contributed to the lines that we take, but not sure. When we take it, it becomes part of codeface-extraction. Maybe we should add a link to the corresponding file in codeface and state that this code originates from there.
And for the dbmanager we should mark what is the original code and what is the new csv-manager you have added.
This may need further discussion.
issue_processing/issue_processing.py
Outdated
| # fix encoding for name and e-mail address | ||
| if user["name"] is not None: | ||
| name = str(user["name"]).encode("utf-8") | ||
| name = user["name"] | ||
| else: | ||
| name = username | ||
| mail = str(user["email"]).encode("utf-8") | ||
| name = username.decode("utf-8") | ||
| mail = user["email"] |
There was a problem hiding this comment.
I don't get this change. Can you please explain this to me @maxloeffler?
Why do we need to call username.decode("utf-8")? And why can we be sure that user["email"] and user["name"] are already in utf-8?
(not only here, same for jira_issue_processing)
There was a problem hiding this comment.
To be honest, I can't be sure whether this solution is generally applicable, thats another reason why it would be nice to have some more testing data. The reason is that with the data I have, user["name"] is already of type str here. Calling encode on it turns it into bytes and this leads to the incorrect names with leading bs in the output csv file that we discussed in the last meetings.
What fortifies me in my expression that this encoding shift is correct is that directly after the lines you commented on, there is a call get_user_string(name, mail) in which concats name and mail into one user_string. For this concatenation to work as expected name and mail must be a str not bytes.
| name = str(user["username"]).encode("utf-8") | ||
| mail = str(user["email"]).encode("utf-8") # empty | ||
| name = username.decode("utf-8") | ||
| mail = user["email"] |
There was a problem hiding this comment.
Please keep the "empty" comment. Email might still be empty in this case
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Using the csvIdManager, we can use all functionality but extraction without a codeface installation. The csvIdManager then replaces the MySQL-based IdManager from codeface. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
|
You did not add your name to the copyright header in a number of files which you have changed just to convert everything from Python2 to Python3. Technically speaking, the Python-version-related changes are changes to the file, so you should add your name to all files in which at least one line has changed. Or do you have a different opinion on that @maxloeffler @hechtlC? Let's discuss this! |
There was a problem hiding this comment.
Pull Request Overview
This pull request modernizes the codebase to be compatible with Python 3 and reduces dependencies on the codeface framework. The main changes include replacing Python 2 syntax with Python 3 equivalents, updating string handling for Unicode compatibility, and incorporating necessary codeface components directly into the project.
Key changes:
- Modernized Python 2 to Python 3 syntax (imports, string handling, iterator methods)
- Removed direct codeface dependencies by creating
codeface_utilspackage with essential components - Updated CSV file handling and string encoding for Python 3 compatibility
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| mbox_parsing/mbox_parsing.py | Updated imports, logging, and string handling for Python 3 |
| issue_processing/jira_issue_processing.py | Modernized syntax, imports, and ID manager usage |
| issue_processing/issue_processing.py | Updated imports, string handling, and database interaction patterns |
| csv_writer/csv_writer.py | Simplified CSV writing for Python 3 with proper encoding |
| codeface_utils/*.py | New utility modules extracted from codeface framework |
| codeface_extraction/*.py | Updated imports and string handling |
| bot_processing/bot_processing.py | Modernized imports and logging calls |
| author_postprocessing/author_postprocessing.py | Updated imports and syntax |
| anonymization/anonymization.py | Modernized imports and dictionary operations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| importlib.reload(sys) | ||
|
|
||
|
|
There was a problem hiding this comment.
The reload(sys) call is unnecessary and potentially problematic in Python 3. The sys.setdefaultencoding method that this was likely trying to enable doesn't exist in Python 3, and reloading the sys module can cause unexpected behavior.
| importlib.reload(sys) |
issue_processing/issue_processing.py
Outdated
| user["email"] = person.getEmail() # column "email1" | ||
| user["name"] = person.getName() # column "name" | ||
| user["id"] = person.getID() # column "id" |
There was a problem hiding this comment.
The code expects person to be a PersonInfo object with getter methods, but the database query likely returns a dictionary. This will cause AttributeError when trying to call .getEmail(), .getName(), and .getID() methods on a dictionary.
| user["email"] = person.getEmail() # column "email1" | |
| user["name"] = person.getName() # column "name" | |
| user["id"] = person.getID() # column "id" | |
| user["email"] = person["email"] # column "email1" | |
| user["name"] = person["name"] # column "name" | |
| user["id"] = person["id"] # column "id" |
There was a problem hiding this comment.
Very nice catch, there was indeed an inconsistency in the return types in getPersonFromID between the dbIdManager and the csvIdManager. Through making it consistent, I could remove another codeface file, namely PersonInfo.
| new_text = re.sub(r"\\ufff.", " ", new_text).encode("unicode-escape") | ||
|
|
||
| # remove all kinds of control characters and emojis | ||
| # see: https://www.fileformat.info/info/unicode/category/index.htm | ||
| new_text = u"".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape")) | ||
| new_text = "".join(ch if unicodedata.category(ch)[0] != "C" else " " for ch in new_text.decode("unicode-escape")) |
There was a problem hiding this comment.
The order of operations is incorrect. The regex substitution should be applied after decoding from unicode-escape, not before encoding. This will not properly remove the unicode specials characters as intended.
| return str(name.decode('utf-8')) | ||
| except LookupError: | ||
| # Encoding not found, return string as is | ||
| return name |
There was a problem hiding this comment.
This unreachable code after the except blocks will never execute, making the function incomplete in some error cases.
There was a problem hiding this comment.
I am unsure what Copilot wants to tell us here. It is true that the except cases are not generic enough to cover all types of exceptions but they cover the expected ones and when an unexpected exception gets raised, then it should probably fly through. Also I am not sure which code it refers to here as "unreachable".
Also this section has been like this prior to my PR.
csv_writer/csv_writer.py
Outdated
| if type(column) is str: | ||
| lineres += (column.encode("utf-8"),) |
There was a problem hiding this comment.
The __encode function is still being used but is no longer needed in Python 3 since strings are Unicode by default. The encoding to UTF-8 bytes will cause issues when writing to a text file opened with UTF-8 encoding.
There was a problem hiding this comment.
Actually, the function is not being used anymore, so let me remove it.
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
issue_processing/issue_processing.py
Outdated
| person = idservice.getPersonFromDB(idx) | ||
| user = dict() | ||
| user["email"] = person.getEmail() # column "email1" | ||
| user["name"] = person.getName() # column "name" | ||
| user["id"] = person.getID() # column "id" | ||
| user = idservice.getPersonFromDB(idx) | ||
| user["email"] = user.pop("email1") |
There was a problem hiding this comment.
Hm, I am wondering whether this returns the correct user dict. I guess user will have more information than user previously had, as previously it had only the fields email, name, and id. But now it has all fields that are present in the person object that is obtained form the database.
This might impact the amount of data that is being held in memory, and also could lead to problems later on when writing the data to disk if not only the relevant three fields are selected. Could you please check what happens with all the useless attributes, are they used/removed later on? (This is not a request to change this already, please check that first and reply before changing this.) Applies similarly also to jira_issue_processing.py
There was a problem hiding this comment.
Oh thats something that I didn't even consider. While newly added users should never have any data fields but email, id, and name that may not be the case when using prefilled codeface databases. I cannot practically test this but the definition of the person table that is being queried by the dbIdManager here indicates the potential presence of more columns. With your approval, I would revert the code sequence to exclude additional columns, e.g.:
person = idservice.getPersonFromDB(idx)
user = {
"name": person["name"],
"email": person["email1"],
"id": person["id"]
}There was a problem hiding this comment.
Thanks for checking. I agree with your suggestion.
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
If useCsv is not set in the config file, set its default to False. Without setting a default, it would always end up in a KeyError when accessing the non-existing entry. Signed-off-by: Thomas Bock <bockthom@cmu.edu>
Configure the new logger to log at INFO level (otherwise nothing would be logged in a normal run), and format the log output similarly to how it was formatted previously in python2. Signed-off-by: Thomas Bock <bockthom@cmu.edu>
|
While trying the python3 version in action for the very first time, I figured out the following two problems:
I have already added two commits to this PR that fix these two issues. I will continue adding commits to this PR (i.e., @maxloeffler's master branch) if I encounter any additional issues that were introduced in this PR in case they need to be fixed prior to merging this PR. |
Changes:
python3using the python modulemodernize.codefacefragments into the codebase.codefaceID databases by adding an alternative ID manager service that operates on .csv files. The codeface ID database will still be required to perform the extraction process.Open Ties:
codefacefragments?codeface-extraction's functionality.