Inconsistent names when running (several) issue extractions

While working on #23, I recognized that there may occur problems with names when extracting several issue data.

Let us consider the following example:

- Firstly, extract github issues.
- Secondly, extract jira issues.
- Finally, run extraction (author, commit, e-mail data).

In this case, we may run into the following problem:

In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.

In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.

In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)

**TLDR:** The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.

After talking to @clhunsen, we came up with a possible solution:

When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in `buffer_db`. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the file `authors.list`. So, what we have to do is: After `authors.list` was generated, we have to re-run both issue extractions **without** updating the database again. To achieve that, we need to re-use the previously dumped `buffer_db` and update the names and e-mail addresses according to `authors.list` and, finally, re-dump the issues.list files again.

In the end, we should use identical names for persons with identical ids in all data sources.

---

Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in  the commit/e-mail data....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent names when running (several) issue extractions #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent names when running (several) issue extractions #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions