You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on #23, I recognized that there may occur problems with names when extracting several issue data.
Let us consider the following example:
Firstly, extract github issues.
Secondly, extract jira issues.
Finally, run extraction (author, commit, e-mail data).
In this case, we may run into the following problem:
In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.
In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.
In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)
TLDR: The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.
After talking to @clhunsen, we came up with a possible solution:
When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in buffer_db. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the file authors.list. So, what we have to do is: After authors.list was generated, we have to re-run both issue extractions without updating the database again. To achieve that, we need to re-use the previously dumped buffer_db and update the names and e-mail addresses according to authors.list and, finally, re-dump the issues.list files again.
In the end, we should use identical names for persons with identical ids in all data sources.
Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in the commit/e-mail data....
The text was updated successfully, but these errors were encountered:
While working on #23, I recognized that there may occur problems with names when extracting several issue data.
Let us consider the following example:
In this case, we may run into the following problem:
In the github-issue extraction, we update names in the database in the case that we match with an already existing e-mail address.
In the jira-issue extraction, we update names in the database in the case that we match with an already existing e-mail address. By doing this, we overwrite the previously updated name originating from the github data. That is, the issues_github.list contains wrong names.
In the final extraction, we fix name encodings. This could potentially threaten the validity of the previously extracted issues_github.list and issues_jira.list. (That should not be the case as we expect to update the name in both issue extractions, but we should have in mind that, in general, extracted names can be different than those stored in the database.)
TLDR: The names of different issue-data extractions are not consistent with those of the author/commit/e-mail data extractions.
After talking to @clhunsen, we came up with a possible solution:
When running issue extractions, we temporarily store persons' ids, names, and e-mail addresses in
buffer_db
. After finishing the extraction, we need to dump this buffer to disk. After finally running the extraction (author, commit, e-mail data), we have the correct name together with id and e-mail address in the fileauthors.list
. So, what we have to do is: Afterauthors.list
was generated, we have to re-run both issue extractions without updating the database again. To achieve that, we need to re-use the previously dumpedbuffer_db
and update the names and e-mail addresses according toauthors.list
and, finally, re-dump the issues.list files again.In the end, we should use identical names for persons with identical ids in all data sources.
Independently from the above mentioned issue, we should somehow check how often persons occurring in issue data get matched with persons occurring in the commit/e-mail data....
The text was updated successfully, but these errors were encountered: