Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload data to GCS when adding to CKAN #28

Open
2 tasks
phargogh opened this issue Feb 13, 2024 · 2 comments
Open
2 tasks

Upload data to GCS when adding to CKAN #28

phargogh opened this issue Feb 13, 2024 · 2 comments

Comments

@phargogh
Copy link
Member

Use case: public data that should be publicly accessible from the catalog.

Assuming:

  • We are creating a new ckan package - it is not already present on ckan
  • The data are located on the local filesystem
  • The data do not already have a URL in their metadata
  • The data should be available publicly once they are published on ckan

Then, when running the upload script to create the package on ckan:

  • The file(s) is/are uploaded to GCS
  • The metadata is rewritten to include the new download url, saved as MCF and ISO XML
  • The package is created on CKAN
  • The various resources are created on CKAN

At this time, we still need to determine:

  • Where the uploaded data should live (which bucket, directory naming conventions, etc.)
  • Whether this should be distinct from any other storage directory layouts we may have for other purposes.
@phargogh
Copy link
Member Author

Look into https://github.com/open-data/ckanext-cloudstorage for a possible alternative that integrates with the existing FileStore API. Maybe this would handle most of the hard work?

@phargogh
Copy link
Member Author

ckanext-cloudstorage is not currently viable

I looked into ckanext-cloudstorage and unfortunately, the package is not a viable solution for us at this time because:

  1. There are 2 closely related repos that might be able to be used: https://github.com/open-data/ckanext-cloudstorage and https://github.com/TkTech/ckanext-cloudstorage, with slightly different maintenance states.
  2. The package only supports python2 (though a python3 PR is in progress: [WIP] Py3 TkTech/ckanext-cloudstorage#42
  3. The package does not yet support GCS (PR in progress: introduce Google Cloud Bucket support TkTech/ckanext-cloudstorage#52)

So, our best solution is to instead manage file uploads to GCS on our own and then post the URLs to CKAN as needed. Access controls to the data would then be handled by the storage backend (GCS, GDrive, etc.) rather than by CKAN, but this is probably safer anyways per Stanford's risk classifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant