Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent names when running (several) issue extractions #24

Open
bockthom opened this issue Jun 4, 2019 · 0 comments
Open

Inconsistent names when running (several) issue extractions #24

bockthom opened this issue Jun 4, 2019 · 0 comments
Labels

Comments

@bockthom
Copy link
Collaborator

bockthom commented Jun 4, 2019

While working on #23, I recognized that there may occur problems with names when extracting several issue data.

Let us consider the following example:

  • Firstly, extract github issues.
  • Secondly, extract jira issues.
  • Finally, run extraction (author, commit, e-mail data).

In this case, we may run into the following problem:

In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.

In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.

In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)

TLDR: The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.

After talking to @clhunsen, we came up with a possible solution:

When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in buffer_db. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the file authors.list. So, what we have to do is: After authors.list was generated, we have to re-run both issue extractions without updating the database again. To achieve that, we need to re-use the previously dumped buffer_db and update the names and e-mail addresses according to authors.list and, finally, re-dump the issues.list files again.

In the end, we should use identical names for persons with identical ids in all data sources.


Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in the commit/e-mail data....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant