Skip to content

Inconsistent names when running (several) issue extractions #24

@bockthom

Description

@bockthom

While working on #23, I recognized that there may occur problems with names when extracting several issue data.

Let us consider the following example:

  • Firstly, extract github issues.
  • Secondly, extract jira issues.
  • Finally, run extraction (author, commit, e-mail data).

In this case, we may run into the following problem:

In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.

In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.

In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)

TLDR: The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.

After talking to @clhunsen, we came up with a possible solution:

When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in buffer_db. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the file authors.list. So, what we have to do is: After authors.list was generated, we have to re-run both issue extractions without updating the database again. To achieve that, we need to re-use the previously dumped buffer_db and update the names and e-mail addresses according to authors.list and, finally, re-dump the issues.list files again.

In the end, we should use identical names for persons with identical ids in all data sources.


Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in the commit/e-mail data....

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions