Inconsistent names when running (several) issue extractions #24

bockthom · 2019-06-04T15:35:03Z

While working on #23, I recognized that there may occur problems with names when extracting several issue data.

Let us consider the following example:

Firstly, extract github issues.
Secondly, extract jira issues.
Finally, run extraction (author, commit, e-mail data).

In this case, we may run into the following problem:

In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.

In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.

In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)

TLDR: The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.

After talking to @clhunsen, we came up with a possible solution:

When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in buffer_db. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the file authors.list. So, what we have to do is: After authors.list was generated, we have to re-run both issue extractions without updating the database again. To achieve that, we need to re-use the previously dumped buffer_db and update the names and e-mail addresses according to authors.list and, finally, re-dump the issues.list files again.

In the end, we should use identical names for persons with identical ids in all data sources.

Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in the commit/e-mail data....

The text was updated successfully, but these errors were encountered:

bockthom added the bug label Jun 4, 2019

bockthom mentioned this issue Jun 8, 2021

Integration of bot data #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent names when running (several) issue extractions #24

Inconsistent names when running (several) issue extractions #24

bockthom commented Jun 4, 2019

Inconsistent names when running (several) issue extractions #24

Inconsistent names when running (several) issue extractions #24

Comments

bockthom commented Jun 4, 2019