Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODS harvesting via simple URL not working #6962

Closed
tkohr opened this issue Mar 31, 2023 · 6 comments
Closed

ODS harvesting via simple URL not working #6962

tkohr opened this issue Mar 31, 2023 · 6 comments
Assignees
Milestone

Comments

@tkohr
Copy link
Contributor

tkohr commented Mar 31, 2023

Describe the bug
Harvesting the following ODS catalog via the simple url harvester (which works on version 4.2.2) does not seem to work anymore. I have the feeling, this is related to the change that the recordIdPath input now expects a path /datasets/datasetid (from the document root?). Or is it just me indicating the wrong path? In version 4.2.2 only the property key datasetid is indicated here.

To Reproduce
Steps to reproduce the behavior:

  1. Go in the admin UI to the harvester settings
  2. Add a new harvester of type simple URL with the following params
  1. Save and harvest

Expected behavior
Harvest ~208 records from the catalog.

Log file
harvester_simpleUrl_MEL_ODS_GN_main_202303301528.log

Desktop (please complete the following information):

  • Browser: Chromium, Version 111.0.5563.110 (Build officiel) snap (64 bits)
  • GeoNetwork Version: main
  • Server Application: Jetty with Java 8
@jahow
Copy link
Contributor

jahow commented Apr 4, 2023

Related to break change introduced in #6677

@fxprunayre
Copy link
Member

Definitely not an ODS expert but it looks like depending on if you request API version 1 and version 2 the datased id path is different.

image

The PR you're pointing at ODS API v2 support was added a3db440 and should have preserved compability with version 1 API.

So before 4.2.3, harvesting an ODS API v2 was not working.

Running on main the following harvester config

{"@id":"228","@type":"simpleurl","owner":["1"],"ownerGroup":["2"],"ownerUser":["undefined"],"site":{"name":"6962","uuid":"d3e54543-097d-4bb8-bfe6-0fa9c04bb73d","account":{"use":false,"username":[],"password":[]},"url":"https://opendata.lillemetropole.fr/api/datasets/1.0/search?refine.publisher=M%C3%A9tropole+Europ%C3%A9enne+de+Lille&start=0&rows=20","icon":"blank.png","loopElement":"/datasets","numberOfRecordPath":"/nhits","recordIdPath":"/datasetid","pageSizeParam":"rows","pageFromParam":"start","toISOConversion":"schema:iso19115-3.2018:convert/fromJsonOpenDataSoft"},"content":{"validate":"NOVALIDATION","importxslt":"none","batchEdits":"[]"},"options":{"every":"0 0 0 ? * *","oneRunOnly":false,"overrideUuid":"SKIP","status":"active"},"privileges":[{"@id":"1","operation":[{"@name":"view"},{"@name":"dynamic"},{"@name":"download"}]}],"ifRecordExistAppendPrivileges":false,"info":{"lastRun":"2023-05-05T05:25:19.923Z","running":false,"result":{"added":"224","atomicDatasetRecords":"0","badFormat":"0","collectionDatasetRecords":"0","datasetUuidExist":"0","privilegesAppendedOnExistingRecord":"0","doesNotValidate":"0","xpathFilterExcluded":"0","duplicatedResource":"0","fragmentsMatched":"0","fragmentsReturned":"0","fragmentsUnknownSchema":"0","incompatible":"0","recordsBuilt":"0","recordsUpdated":"0","removed":"0","serviceRecords":"0","subtemplatesAdded":"0","subtemplatesRemoved":"0","subtemplatesUpdated":"0","total":"224","unchanged":"0","unknownSchema":"0","unretrievable":"0","updated":"0","thumbnails":"0","thumbnailsFailed":"0"}}}

for v1 API collects 224 records.

and playing

{"@id":"373","@type":"simpleurl","owner":["1"],"ownerGroup":["2"],"ownerUser":["undefined"],"site":{"name":"6962 v2","uuid":"cc6c2ae1-34a8-4ac6-bd19-8df33098f61b","account":{"use":false,"username":[],"password":[]},"url":"https://opendata.lillemetropole.fr/api/explore/v2.0/catalog/datasets?rows=100","icon":"blank.png","loopElement":"/datasets","numberOfRecordPath":"/nhits","recordIdPath":"/dataset/dataset_id","pageSizeParam":"rows","pageFromParam":"start","toISOConversion":"schema:iso19115-3.2018:convert/fromJsonOpenDataSoft"},"content":{"validate":"NOVALIDATION","importxslt":"none","batchEdits":"[]"},"options":{"every":"0 0 0 ? * *","oneRunOnly":false,"overrideUuid":"SKIP","status":"active"},"privileges":[{"@id":"1","operation":[{"@name":"view"},{"@name":"dynamic"},{"@name":"download"}]}],"ifRecordExistAppendPrivileges":false,"info":{"lastRun":"2023-05-05T05:46:25.882Z","running":false,"result":{"added":"10","atomicDatasetRecords":"0","badFormat":"0","collectionDatasetRecords":"0","datasetUuidExist":"0","privilegesAppendedOnExistingRecord":"0","doesNotValidate":"0","xpathFilterExcluded":"0","duplicatedResource":"0","fragmentsMatched":"0","fragmentsReturned":"0","fragmentsUnknownSchema":"0","incompatible":"0","recordsBuilt":"0","recordsUpdated":"0","removed":"1","serviceRecords":"0","subtemplatesAdded":"0","subtemplatesRemoved":"0","subtemplatesUpdated":"0","total":"10","unchanged":"0","unknownSchema":"0","unretrievable":"0","updated":"0","thumbnails":"0","thumbnailsFailed":"0"}}}

collect 100 records

So this seems fine to me, no?

@fxprunayre fxprunayre added this to the 4.2.3 milestone May 5, 2023
@fxprunayre
Copy link
Member

So your issue was related to

String uuid = this.extractUuidFromIdentifier(record.get(params.recordIdPath).asText());

which only works if the property you need is a property of the loopElement node which is not the case in all JSON harvester and not in ODS API v2. So it was indeed changed to

String uuid = this.extractUuidFromIdentifier(record.at(params.recordIdPath).asText());

This explains why your config in 4.2.2 did not work in 4.2.3. By the way, a quite clear error is reported in the harvester log

2023-05-05T13:42:40,976 ERROR [geonetwork.harvester] -
 Failed to collect record UUID at path datasetid. 
 Error is: Invalid input:
 JSON Pointer expression must start with '/': "datasetid"

@tkohr
Copy link
Contributor Author

tkohr commented May 5, 2023

Thanks for looking into this @fxprunayre. Indeed, in the end, it's just the missing / that breaks the ODS config from GN 4.2.2 to > 4.2.2.

I didn't pay attention that the mentioned PR was using ODS v2 having a different hierarchy and keys datasetid, dataset_id from V1, which obscured the problem a little, despite the rather clear error message.

@jahow
Copy link
Contributor

jahow commented May 5, 2023

Just to clarify, this had nothing to do with ODS API v2 (which we don't use). It was an error on our side, indeed it works with the new format for the record id pointer.

Thanks @fxprunayre

@jahow jahow closed this as completed May 5, 2023
@tkohr
Copy link
Contributor Author

tkohr commented May 9, 2023

FYI, I opened geonetwork/doc#240 regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants