Always return `CampaignID` and make it unique across database #136

peterdesmet · 2022-10-28T09:50:13Z

I'm not sure this is currently an issue in the database, but it might become one:

Identifier requirements

campaignID: should only be unique within a DataRightsHolder
sampleID: should only be unique within a campaignID
positionID: should only be unique within a sampleID
observationID: should only be unique within a positionID

Note that even within a single file submissions, identifiers are not expected to be unique, only within their respective parent. @nicholasvanermen @cmspinto can you confirm this?

Therefore:

dataRightsHolder,campaignID,sampleID,positionID,observationID
A               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S2      ,P1        ,O1
B               ,C2        ,S1      ,P1        ,O1

Are all different records.

The issue

The CSV files that are returned, only list the direct parent identifier. So, if one would download the above data:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
C1,S1
C1,S1 <- looks like a duplicate, but is from a different Data Rights Holder
C1,S2
C2,S1

## positions.csv
S1,P1
S1,P1 <- looks like a duplicate, but is from a different Data Rights Holder
S2,P1
S1,P1 <- looks like a duplicate, but is from a different campaign

## observations.csv
P1,O1
P1,O1 <- looks like a duplicate, but is from a different Data Rights Holder
P1,O1 <- looks like a duplicate, but is from a different campaign and sample
P1,O1 <- looks like a duplicate, but is from a different sample

If the user would create joins between the file, they would get incorrect data. There is no way for the user to prevent this.

Solution 1 (does not work)

It is tempting to think that adding a fileSubmissionID (below 230 and 240) to each CSV file solves the issue, but it does not:

# campaigns.csv
239,A,C1
240,B,C1
240,B,C2

# samples.csv
239,C1,S1
240,C1,S1 <- no longer a duplicate
240,C1,S2
240,C2,S1

## positions.csv
239,S1,P1
240,S1,P1 <- no longer a duplicate
240,S2,P1
240,S1,P1 <- still looks like a duplicate, but is from a different campaign

## observations.csv
239,P1,O1
240,P1,O1 <- no longer a duplicate
240,P1,O1 <- still looks like a duplicate, but is from a different campaign and sample
240,P1,O1 <- still looks like a duplicate, but is from a different sample

Solution 2

I think the only way to solve it is to add the data rightsholder and all the parent identifiers in all the CSV files. This is quite verbose, but at least straightforward and it solves incorrect joins:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
A,C1,S1
B,C1,S1 <- no longer a duplicate
B,C1,S2
B,C2,S1

## positions.csv
A,C1,S1,P1
B,C1,S1,P1 <- no longer a duplicate
B,C1,S2,P1
B,C2,S1,P1 <- no longer a duplicate

## observations.csv
A,C1,S1,P1,O1
B,C1,S1,P1,O1 <- no longer a duplicate
B,C1,S2,P1,O1 <- no longer a duplicate
B,C2,S1,P1,O1 <- no longer a duplicate

API

In the API, there is already a way to prevent incorrect joins with:

"tblUploadID": 234,
"tblCampaignID": 209487,
...

However, I would also add the dataRightsHolder and all parent identifiers to the API, so the same joins can be used irrespective of CSV or API use.

@nicolasvanermen @cmspinto thoughts?

The text was updated successfully, but these errors were encountered:

cmspinto · 2022-10-28T10:20:44Z

I'm a bit lost here!
I just checked and it seems to me there are no duplicates in the database.
Is there any output (download or API) that has duplicated records?

I'm having a bit of trouble to see what is the issue here?
When we output the DatarightsHolder we put it using the character '~'

Sorry, if I'm missing the obvious.

peterdesmet · 2022-10-28T11:20:40Z

I just checked and it seems to me there are no duplicates in the database.

Good, I'll investigate #135 further

This is not about duplicate records. It is about avoiding unexpected results when users are left joining data:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.positionID = o.positionID

One would expect to create a full denormalized (flat) table with all data this way. However, it is very likely more rows will be created than expected, because identifiers are only unique within a certain set. Note that the current data in the database might not reflect this issue.

So a better query would be:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.dataRightsHolder = s.dataRightsHolder
    AND c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.dataRightsHolder = p.dataRightsHolder
    AND s.campaignID = p.campaignID
    AND s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.dataRightsHolder = o.dataRightsHolder
    AND p.campaignID = o.campaignID
    AND p.sampleID = o.sampleID
    ON p.positionID = o.positionID

Please correct me if I'm wrong.

nicolasvanermen · 2022-10-31T09:07:28Z

I am not against adding all parent identifiers to the download tables... But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier) I guess the point is that they will be less inclined to do so?

peterdesmet · 2022-10-31T10:53:39Z

But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier)

Indeed. The issue is that currently the cannot might the correct joins, because not all identifiers are provided.

cmspinto · 2022-11-02T10:53:32Z

@peterdesmet I agree with you.
The download is incomplete!
I can add the internal ID's of the database to all the files. That should make it clear and enough?
So make it closer to what is the API.

https://esas.ices.dk/API/Download?DownloadID=973d90f7-6f33-4452-bb6b-5801768e89e7

the problem currently is that if two countries decide to have the same Campaign then there is a problem to join the files!

peterdesmet · 2022-11-03T08:49:26Z

I need to think about this a bit more. One of the issues with the internal IDs is that one cannot query for them.

Also, in the API you would expect:

all_campaigns = getCampaigns
all_obs = empty
for (campaign in all_campaigns) {
  obs = getObservations?campaignID=campaign # This can return observations from multiple campaigns
  all_obs = all_obs + obs
)
obs # This can contain duplicates

Maybe we need stricter rules on the identifiers, e.g. a data rights holder cannot reuse a campaignID and all sampleID, positionID and observationID within a campaign need to be unique. I'll have a look in the data and see what is feasable.

peterdesmet · 2022-11-07T15:36:08Z

@nicolasvanermen how comfortable are you with requiring people to upload data with a campaignID that is unique for the whole database? This was previously done by reserving "blocks" of identifiers, but could also be done by prepending identifiers (in future submissions) e.g. INBO:123.

nicolasvanermen · 2022-11-08T07:34:49Z

I think this makes perfect sense, as this is the case in the database already. An acronym prefix makes sense as well.

peterdesmet · 2022-11-08T12:08:29Z

Current situation in the database

campaigns: 24888 records, all campaignID are unique
samples: 25309 records, all sampleID are unique
positions: 1656728 records, all positionID are unique
observations: 2957702 records, not all observationID are unique (291,225 duplicates, such as 100001)

Requirements

Given that, I would suggest (@nicolasvanermen can you confirm if you agree):

That campaignID is unique across the whole database (see Always return CampaignID and make it unique across database #136 (comment))
That sampleID, positionID and observationID are unique within a campaignID even though most are currently unique across a wider scope. The alternative is to ask that sampleID and positionID are also unique across the whole database (currently the case), which would make it easier for users, but that's maybe too big of an ask for providers?

The result is that people can safely join tables, as long as they include campaignID in their join:

campaigns AS c
  LEFT JOIN samples AS S
    ON s.campaignID = c.campaignID
  LEFT JOIN positions AS p
    ON p.sampleID = s.sampleID
    AND p.campaignID = c.campaignID
  LEFT JOIN observations AS o
    ON o.positionID = p.positionID
    AND o.campaignID = c.campaignID

These changes require no migration of data (i.e. all current data already meet the requirements).

Implementation (@Osanna123 @cmspinto)

peterdesmet · 2022-11-17T13:59:02Z

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

cmspinto · 2022-11-21T10:07:24Z

@peterdesmet sorry to solve this issue only now.
A quick comment, in the webAPI there is no issue, because the records have the identifier:

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)
Do you still want us to add CampaignID to the results?

About the download, changes are done:

peterdesmet · 2022-11-21T10:15:07Z

@cmspinto yes, I would still like the changes to the API 1) so users or implementations don't have to rely on the tbl fields and 2) so the API result is almost interchangeable with CSV result

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)

Note that we won't allow this at submission.

nicolasvanermen · 2022-11-21T13:47:06Z

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

This is not a preference, rather a pragmatic approval. Of course it would be ideal that all keys are unique within the database... But in that case we should propose a specific 'system', for example through a data provider specific number or letter combination as prefix for each key...?

cmspinto · 2022-11-23T09:09:24Z

@nicolasvanermen, the platform is not affected by the users uploading the same CampaignID in distinct downloads.
I think we are doing this only to make sure that when users download data they link distinct files correctly.

The QC check is here: That CampaingID was already used by another organization, please use another one.

All changes are in place now.

peterdesmet · 2022-11-23T09:16:03Z

@cmspinto great to see that the check is in place and that CampaignIDs are now required to be unique across the database.

@nicolasvanermen as discussed between us, let's use the pragmatic approach and only require CampaignIDs to be unique across the database. Other IDs have to be unique within a CampaignIDs. All joins will have to take CampaignID into account and should therefore be included in all CSV and API results.

nicolasvanermen · 2022-11-23T10:30:39Z

OK, I will check this, by uploading a record of JNCC with a campaignID from INBO...

nicolasvanermen · 2022-11-23T10:49:34Z

The duplicate campaignID is filtered out OK, but the message says:
Cross field check (condition not met)

This seems to uninformative... Can the message be aligned with the check description ("That CampaingID was already used by another organization, please use another one.")?

cmspinto · 2022-11-24T08:42:46Z

Done: http://datsu.ices.dk/web/chkDetails.aspx?DatasetID=148&RecordID=103&groupFilterID=1981

nicolasvanermen · 2022-11-24T09:06:39Z

All keys are now returned in the download, can this issue be closed @peterdesmet ?

peterdesmet · 2022-11-24T09:59:00Z

Yes, they are also all returned by the API and all issues are resolved at #136 (comment). I will update the documentation and close this issue.

peterdesmet · 2022-11-24T11:21:41Z

Now clarified in the documentation: 80cc9a7

@cmspinto or @Osanna123 is it possible to update some of the descriptions in http://datsu.ices.dk/web/selRep.aspx?Dataset=148 to reflect the changes in 80cc9a7?

nicolasvanermen · 2022-12-01T06:01:47Z

Typo in the screening message: "That CampaignID..." (instead of CampaingID)

And while your are at it... "This CampaignID..." sounds better :-)

cmspinto · 2022-12-02T09:10:14Z

Done: http://datsu.ices.dk/web/rptChk.aspx?Dataset=148

peterdesmet · 2022-12-02T10:41:54Z

Thanks, only remaining item is the documentation fix mentioned in #136 (comment)

cmspinto · 2023-01-26T12:23:25Z

Can we close this one?

peterdesmet · 2023-01-26T21:10:25Z

No, there’s a remaining action item for you. The DATSU definitions should be updated. See #136 (comment)

peterdesmet assigned nicolasvanermen and cmspinto Oct 28, 2022

cmspinto added help wanted Extra attention is needed 3.Database labels Oct 28, 2022

cmspinto assigned Osanna123 Oct 28, 2022

peterdesmet changed the title ~~Should we always return DataRightsHolder and CampaignID~~ Always return CampaignID and make it unique across database Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always return `CampaignID` and make it unique across database #136

Always return `CampaignID` and make it unique across database #136

peterdesmet commented Oct 28, 2022

cmspinto commented Oct 28, 2022

peterdesmet commented Oct 28, 2022

nicolasvanermen commented Oct 31, 2022

peterdesmet commented Oct 31, 2022

cmspinto commented Nov 2, 2022 •

edited

Loading

peterdesmet commented Nov 3, 2022

peterdesmet commented Nov 7, 2022

nicolasvanermen commented Nov 8, 2022

peterdesmet commented Nov 8, 2022 •

edited by cmspinto

Loading

peterdesmet commented Nov 17, 2022

cmspinto commented Nov 21, 2022

peterdesmet commented Nov 21, 2022

nicolasvanermen commented Nov 21, 2022

cmspinto commented Nov 23, 2022 •

edited

Loading

peterdesmet commented Nov 23, 2022

nicolasvanermen commented Nov 23, 2022

nicolasvanermen commented Nov 23, 2022

cmspinto commented Nov 24, 2022

nicolasvanermen commented Nov 24, 2022

peterdesmet commented Nov 24, 2022

peterdesmet commented Nov 24, 2022

nicolasvanermen commented Dec 1, 2022

cmspinto commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

cmspinto commented Jan 26, 2023

peterdesmet commented Jan 26, 2023

Always return CampaignID and make it unique across database #136

Always return CampaignID and make it unique across database #136

Comments

peterdesmet commented Oct 28, 2022

Identifier requirements

The issue

Solution 1 (does not work)

Solution 2

API

cmspinto commented Oct 28, 2022

peterdesmet commented Oct 28, 2022

nicolasvanermen commented Oct 31, 2022

peterdesmet commented Oct 31, 2022

cmspinto commented Nov 2, 2022 • edited Loading

peterdesmet commented Nov 3, 2022

peterdesmet commented Nov 7, 2022

nicolasvanermen commented Nov 8, 2022

peterdesmet commented Nov 8, 2022 • edited by cmspinto Loading

Current situation in the database

Requirements

Implementation (@Osanna123 @cmspinto)

peterdesmet commented Nov 17, 2022

cmspinto commented Nov 21, 2022

peterdesmet commented Nov 21, 2022

nicolasvanermen commented Nov 21, 2022

cmspinto commented Nov 23, 2022 • edited Loading

peterdesmet commented Nov 23, 2022

nicolasvanermen commented Nov 23, 2022

nicolasvanermen commented Nov 23, 2022

cmspinto commented Nov 24, 2022

nicolasvanermen commented Nov 24, 2022

peterdesmet commented Nov 24, 2022

peterdesmet commented Nov 24, 2022

nicolasvanermen commented Dec 1, 2022

cmspinto commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

cmspinto commented Jan 26, 2023

peterdesmet commented Jan 26, 2023

Always return `CampaignID` and make it unique across database #136

Always return `CampaignID` and make it unique across database #136

cmspinto commented Nov 2, 2022 •

edited

Loading

peterdesmet commented Nov 8, 2022 •

edited by cmspinto

Loading

cmspinto commented Nov 23, 2022 •

edited

Loading