Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always return CampaignID and make it unique across database #136

Open
peterdesmet opened this issue Oct 28, 2022 · 26 comments
Open

Always return CampaignID and make it unique across database #136

peterdesmet opened this issue Oct 28, 2022 · 26 comments
Assignees
Labels
3.Database help wanted Extra attention is needed

Comments

@peterdesmet
Copy link
Collaborator

I'm not sure this is currently an issue in the database, but it might become one:

Identifier requirements

  • campaignID: should only be unique within a DataRightsHolder
  • sampleID: should only be unique within a campaignID
  • positionID: should only be unique within a sampleID
  • observationID: should only be unique within a positionID

Note that even within a single file submissions, identifiers are not expected to be unique, only within their respective parent. @nicholasvanermen @cmspinto can you confirm this?

Therefore:

dataRightsHolder,campaignID,sampleID,positionID,observationID
A               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S2      ,P1        ,O1
B               ,C2        ,S1      ,P1        ,O1

Are all different records.

The issue

The CSV files that are returned, only list the direct parent identifier. So, if one would download the above data:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
C1,S1
C1,S1 <- looks like a duplicate, but is from a different Data Rights Holder
C1,S2
C2,S1

## positions.csv
S1,P1
S1,P1 <- looks like a duplicate, but is from a different Data Rights Holder
S2,P1
S1,P1 <- looks like a duplicate, but is from a different campaign

## observations.csv
P1,O1
P1,O1 <- looks like a duplicate, but is from a different Data Rights Holder
P1,O1 <- looks like a duplicate, but is from a different campaign and sample
P1,O1 <- looks like a duplicate, but is from a different sample

If the user would create joins between the file, they would get incorrect data. There is no way for the user to prevent this.

Solution 1 (does not work)

It is tempting to think that adding a fileSubmissionID (below 230 and 240) to each CSV file solves the issue, but it does not:

# campaigns.csv
239,A,C1
240,B,C1
240,B,C2

# samples.csv
239,C1,S1
240,C1,S1 <- no longer a duplicate
240,C1,S2
240,C2,S1

## positions.csv
239,S1,P1
240,S1,P1 <- no longer a duplicate
240,S2,P1
240,S1,P1 <- still looks like a duplicate, but is from a different campaign

## observations.csv
239,P1,O1
240,P1,O1 <- no longer a duplicate
240,P1,O1 <- still looks like a duplicate, but is from a different campaign and sample
240,P1,O1 <- still looks like a duplicate, but is from a different sample

Solution 2

I think the only way to solve it is to add the data rightsholder and all the parent identifiers in all the CSV files. This is quite verbose, but at least straightforward and it solves incorrect joins:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
A,C1,S1
B,C1,S1 <- no longer a duplicate
B,C1,S2
B,C2,S1

## positions.csv
A,C1,S1,P1
B,C1,S1,P1 <- no longer a duplicate
B,C1,S2,P1
B,C2,S1,P1 <- no longer a duplicate

## observations.csv
A,C1,S1,P1,O1
B,C1,S1,P1,O1 <- no longer a duplicate
B,C1,S2,P1,O1 <- no longer a duplicate
B,C2,S1,P1,O1 <- no longer a duplicate

API

In the API, there is already a way to prevent incorrect joins with:

"tblUploadID": 234,
"tblCampaignID": 209487,
...

However, I would also add the dataRightsHolder and all parent identifiers to the API, so the same joins can be used irrespective of CSV or API use.

@nicolasvanermen @cmspinto thoughts?

@cmspinto
Copy link
Collaborator

I'm a bit lost here!
I just checked and it seems to me there are no duplicates in the database.
Is there any output (download or API) that has duplicated records?

I'm having a bit of trouble to see what is the issue here?
When we output the DatarightsHolder we put it using the character '~'

Sorry, if I'm missing the obvious.

@peterdesmet
Copy link
Collaborator Author

I just checked and it seems to me there are no duplicates in the database.

Good, I'll investigate #135 further

This is not about duplicate records. It is about avoiding unexpected results when users are left joining data:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.positionID = o.positionID

One would expect to create a full denormalized (flat) table with all data this way. However, it is very likely more rows will be created than expected, because identifiers are only unique within a certain set. Note that the current data in the database might not reflect this issue.

So a better query would be:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.dataRightsHolder = s.dataRightsHolder
    AND c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.dataRightsHolder = p.dataRightsHolder
    AND s.campaignID = p.campaignID
    AND s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.dataRightsHolder = o.dataRightsHolder
    AND p.campaignID = o.campaignID
    AND p.sampleID = o.sampleID
    ON p.positionID = o.positionID

Please correct me if I'm wrong.

@nicolasvanermen
Copy link
Collaborator

I am not against adding all parent identifiers to the download tables... But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier) I guess the point is that they will be less inclined to do so?

@peterdesmet
Copy link
Collaborator Author

But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier)

Indeed. The issue is that currently the cannot might the correct joins, because not all identifiers are provided.

@cmspinto
Copy link
Collaborator

cmspinto commented Nov 2, 2022

@peterdesmet I agree with you.
The download is incomplete!
I can add the internal ID's of the database to all the files. That should make it clear and enough?
So make it closer to what is the API.

https://esas.ices.dk/API/Download?DownloadID=973d90f7-6f33-4452-bb6b-5801768e89e7

the problem currently is that if two countries decide to have the same Campaign then there is a problem to join the files!

@peterdesmet
Copy link
Collaborator Author

I need to think about this a bit more. One of the issues with the internal IDs is that one cannot query for them.

Also, in the API you would expect:

all_campaigns = getCampaigns
all_obs = empty
for (campaign in all_campaigns) {
  obs = getObservations?campaignID=campaign # This can return observations from multiple campaigns
  all_obs = all_obs + obs
)
obs # This can contain duplicates

Maybe we need stricter rules on the identifiers, e.g. a data rights holder cannot reuse a campaignID and all sampleID, positionID and observationID within a campaign need to be unique. I'll have a look in the data and see what is feasable.

@peterdesmet
Copy link
Collaborator Author

@nicolasvanermen how comfortable are you with requiring people to upload data with a campaignID that is unique for the whole database? This was previously done by reserving "blocks" of identifiers, but could also be done by prepending identifiers (in future submissions) e.g. INBO:123.

@nicolasvanermen
Copy link
Collaborator

I think this makes perfect sense, as this is the case in the database already. An acronym prefix makes sense as well.

@peterdesmet
Copy link
Collaborator Author

peterdesmet commented Nov 8, 2022

Current situation in the database

  • campaigns: 24888 records, all campaignID are unique
  • samples: 25309 records, all sampleID are unique
  • positions: 1656728 records, all positionID are unique
  • observations: 2957702 records, not all observationID are unique (291,225 duplicates, such as 100001)

Requirements

Given that, I would suggest (@nicolasvanermen can you confirm if you agree):

  • That campaignID is unique across the whole database (see Always return CampaignID and make it unique across database #136 (comment))
  • That sampleID, positionID and observationID are unique within a campaignID even though most are currently unique across a wider scope. The alternative is to ask that sampleID and positionID are also unique across the whole database (currently the case), which would make it easier for users, but that's maybe too big of an ask for providers?

The result is that people can safely join tables, as long as they include campaignID in their join:

campaigns AS c
  LEFT JOIN samples AS S
    ON s.campaignID = c.campaignID
  LEFT JOIN positions AS p
    ON p.sampleID = s.sampleID
    AND p.campaignID = c.campaignID
  LEFT JOIN observations AS o
    ON o.positionID = p.positionID
    AND o.campaignID = c.campaignID

These changes require no migration of data (i.e. all current data already meet the requirements).

Implementation (@Osanna123 @cmspinto)

  • When uploading, QA should return an error if a campaignID is submitted that already exists in the database in combination with a different dataRightsHolder. Note campaignID in combination with the same dataRightsHolder still indicates a resubmission.
  • When uploading, QA should return an error if sampleID, positionID or observationID are not unique in combination with the campaignID. Note that this can be verified within the file (does not depend on data in the database).
  • API:
    • getCampaignRecords: no changes (campaignID)
    • getSampleRecords: no changes (campaignID, sampleID)
    • getPositionRecords: add campaignID to results, right before sampleID (campaignID, sampleID, positionID). It is already a query parameter.
    • getObservationRecords: add campaignID (and for good measure sampleID) to results, right before positionID (campaignID, sampleID, positionID, observationID). Both are already query parameters.
  • CSV download:
    • campaigns.csv: no changes (campaignID)
    • samples.csv: no changes (campaignID, sampleID)
    • positions.csv: add campaignID to results, right before sampleID (campaignID, sampleID, positionID)
    • observations.csv: add campaignID (and for good measure sampleID) to results, right before positionID (campaignID, sampleID, positionID, observationID)

@peterdesmet peterdesmet changed the title Should we always return DataRightsHolder and CampaignID Always return CampaignID and make it unique across database Nov 16, 2022
@peterdesmet
Copy link
Collaborator Author

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

@cmspinto
Copy link
Collaborator

@peterdesmet sorry to solve this issue only now.
A quick comment, in the webAPI there is no issue, because the records have the identifier:
image
image

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)
Do you still want us to add CampaignID to the results?

About the download, changes are done:

image

@peterdesmet
Copy link
Collaborator Author

@cmspinto yes, I would still like the changes to the API 1) so users or implementations don't have to rely on the tbl fields and 2) so the API result is almost interchangeable with CSV result

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)

Note that we won't allow this at submission.

@nicolasvanermen
Copy link
Collaborator

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

This is not a preference, rather a pragmatic approval. Of course it would be ideal that all keys are unique within the database... But in that case we should propose a specific 'system', for example through a data provider specific number or letter combination as prefix for each key...?

@cmspinto
Copy link
Collaborator

cmspinto commented Nov 23, 2022

@nicolasvanermen, the platform is not affected by the users uploading the same CampaignID in distinct downloads.
I think we are doing this only to make sure that when users download data they link distinct files correctly.

The QC check is here: That CampaingID was already used by another organization, please use another one.

All changes are in place now.

@peterdesmet
Copy link
Collaborator Author

@cmspinto great to see that the check is in place and that CampaignIDs are now required to be unique across the database.

@nicolasvanermen as discussed between us, let's use the pragmatic approach and only require CampaignIDs to be unique across the database. Other IDs have to be unique within a CampaignIDs. All joins will have to take CampaignID into account and should therefore be included in all CSV and API results.

@nicolasvanermen
Copy link
Collaborator

OK, I will check this, by uploading a record of JNCC with a campaignID from INBO...

@nicolasvanermen
Copy link
Collaborator

The duplicate campaignID is filtered out OK, but the message says:
Cross field check (condition not met)

This seems to uninformative... Can the message be aligned with the check description ("That CampaingID was already used by another organization, please use another one.")?

@cmspinto
Copy link
Collaborator

Done: http://datsu.ices.dk/web/chkDetails.aspx?DatasetID=148&RecordID=103&groupFilterID=1981

@nicolasvanermen
Copy link
Collaborator

All keys are now returned in the download, can this issue be closed @peterdesmet ?

@peterdesmet
Copy link
Collaborator Author

Yes, they are also all returned by the API and all issues are resolved at #136 (comment). I will update the documentation and close this issue.

@peterdesmet
Copy link
Collaborator Author

Now clarified in the documentation: 80cc9a7

@cmspinto or @Osanna123 is it possible to update some of the descriptions in http://datsu.ices.dk/web/selRep.aspx?Dataset=148 to reflect the changes in 80cc9a7?

@nicolasvanermen
Copy link
Collaborator

Typo in the screening message: "That CampaignID..." (instead of CampaingID)

And while your are at it... "This CampaignID..." sounds better :-)

@cmspinto
Copy link
Collaborator

cmspinto commented Dec 2, 2022

Done: http://datsu.ices.dk/web/rptChk.aspx?Dataset=148

@peterdesmet
Copy link
Collaborator Author

Thanks, only remaining item is the documentation fix mentioned in #136 (comment)

@cmspinto
Copy link
Collaborator

Can we close this one?

@peterdesmet
Copy link
Collaborator Author

No, there’s a remaining action item for you. The DATSU definitions should be updated. See #136 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.Database help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants