-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always return CampaignID
and make it unique across database
#136
Comments
I'm a bit lost here! I'm having a bit of trouble to see what is the issue here? Sorry, if I'm missing the obvious. |
Good, I'll investigate #135 further This is not about duplicate records. It is about avoiding unexpected results when users are left joining data: campaigns AS c
LEFT JOIN samples AS s
ON c.campaignID = s.campaignID
LEFT JOIN positions AS p
ON s.sampleID = p.sampleID
LEFT JOIN observations AS o
ON p.positionID = o.positionID One would expect to create a full denormalized (flat) table with all data this way. However, it is very likely more rows will be created than expected, because identifiers are only unique within a certain set. Note that the current data in the database might not reflect this issue. So a better query would be: campaigns AS c
LEFT JOIN samples AS s
ON c.dataRightsHolder = s.dataRightsHolder
AND c.campaignID = s.campaignID
LEFT JOIN positions AS p
ON s.dataRightsHolder = p.dataRightsHolder
AND s.campaignID = p.campaignID
AND s.sampleID = p.sampleID
LEFT JOIN observations AS o
ON p.dataRightsHolder = o.dataRightsHolder
AND p.campaignID = o.campaignID
AND p.sampleID = o.sampleID
ON p.positionID = o.positionID Please correct me if I'm wrong. |
I am not against adding all parent identifiers to the download tables... But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier) I guess the point is that they will be less inclined to do so? |
Indeed. The issue is that currently the cannot might the correct joins, because not all identifiers are provided. |
@peterdesmet I agree with you. https://esas.ices.dk/API/Download?DownloadID=973d90f7-6f33-4452-bb6b-5801768e89e7 the problem currently is that if two countries decide to have the same Campaign then there is a problem to join the files! |
I need to think about this a bit more. One of the issues with the internal IDs is that one cannot query for them. Also, in the API you would expect: all_campaigns = getCampaigns
all_obs = empty
for (campaign in all_campaigns) {
obs = getObservations?campaignID=campaign # This can return observations from multiple campaigns
all_obs = all_obs + obs
)
obs # This can contain duplicates Maybe we need stricter rules on the identifiers, e.g. a data rights holder cannot reuse a |
@nicolasvanermen how comfortable are you with requiring people to upload data with a |
I think this makes perfect sense, as this is the case in the database already. An acronym prefix makes sense as well. |
Current situation in the database
RequirementsGiven that, I would suggest (@nicolasvanermen can you confirm if you agree):
The result is that people can safely join tables, as long as they include campaigns AS c
LEFT JOIN samples AS S
ON s.campaignID = c.campaignID
LEFT JOIN positions AS p
ON p.sampleID = s.sampleID
AND p.campaignID = c.campaignID
LEFT JOIN observations AS o
ON o.positionID = p.positionID
AND o.campaignID = c.campaignID These changes require no migration of data (i.e. all current data already meet the requirements). Implementation (@Osanna123 @cmspinto)
|
DataRightsHolder
and CampaignID
CampaignID
and make it unique across database
@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer |
@peterdesmet sorry to solve this issue only now. so if two organizations submit the same campaignID, that will not be an issue (for people that use the API) About the download, changes are done: |
@cmspinto yes, I would still like the changes to the API 1) so users or implementations don't have to rely on the
Note that we won't allow this at submission. |
This is not a preference, rather a pragmatic approval. Of course it would be ideal that all keys are unique within the database... But in that case we should propose a specific 'system', for example through a data provider specific number or letter combination as prefix for each key...? |
@nicolasvanermen, the platform is not affected by the users uploading the same CampaignID in distinct downloads. The QC check is here: That CampaingID was already used by another organization, please use another one. All changes are in place now. |
@cmspinto great to see that the check is in place and that @nicolasvanermen as discussed between us, let's use the pragmatic approach and only require |
OK, I will check this, by uploading a record of JNCC with a campaignID from INBO... |
The duplicate campaignID is filtered out OK, but the message says: This seems to uninformative... Can the message be aligned with the check description ("That CampaingID was already used by another organization, please use another one.")? |
All keys are now returned in the download, can this issue be closed @peterdesmet ? |
Yes, they are also all returned by the API and all issues are resolved at #136 (comment). I will update the documentation and close this issue. |
Now clarified in the documentation: 80cc9a7 @cmspinto or @Osanna123 is it possible to update some of the descriptions in http://datsu.ices.dk/web/selRep.aspx?Dataset=148 to reflect the changes in 80cc9a7? |
Typo in the screening message: "That CampaignID..." (instead of CampaingID) And while your are at it... "This CampaignID..." sounds better :-) |
Thanks, only remaining item is the documentation fix mentioned in #136 (comment) |
Can we close this one? |
No, there’s a remaining action item for you. The DATSU definitions should be updated. See #136 (comment) |
I'm not sure this is currently an issue in the database, but it might become one:
Identifier requirements
campaignID
: should only be unique within aDataRightsHolder
sampleID
: should only be unique within acampaignID
positionID
: should only be unique within asampleID
observationID
: should only be unique within apositionID
Note that even within a single file submissions, identifiers are not expected to be unique, only within their respective parent. @nicholasvanermen @cmspinto can you confirm this?
Therefore:
Are all different records.
The issue
The CSV files that are returned, only list the direct parent identifier. So, if one would download the above data:
If the user would create joins between the file, they would get incorrect data. There is no way for the user to prevent this.
Solution 1 (does not work)
It is tempting to think that adding a
fileSubmissionID
(below230
and240
) to each CSV file solves the issue, but it does not:Solution 2
I think the only way to solve it is to add the data rightsholder and all the parent identifiers in all the CSV files. This is quite verbose, but at least straightforward and it solves incorrect joins:
API
In the API, there is already a way to prevent incorrect joins with:
However, I would also add the dataRightsHolder and all parent identifiers to the API, so the same joins can be used irrespective of CSV or API use.
@nicolasvanermen @cmspinto thoughts?
The text was updated successfully, but these errors were encountered: