Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce Google Cloud Bucket support #52

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5c37379
pgrade libcloud to 2.8.3
ccancellieri Oct 9, 2021
cbbbef3
introduce Google Cloud Bucket support #51
ccancellieri Oct 9, 2021
7cc515e
add internal proxy for private files
ccancellieri Oct 9, 2021
12d2c36
introduce Google Cloud Bucket signed temporary url
ccancellieri Oct 9, 2021
4932a3b
better document google installation and configuration
ccancellieri Oct 9, 2021
567c2d2
safer fail fast check
ccancellieri Oct 9, 2021
0428a6d
Merge remote-tracking branch 'origin/patch-1' into google-cloud-support
Dec 8, 2023
4aade5a
Merge remote-tracking branch 'origin/proxy-support' into google-cloud…
Dec 8, 2023
b01a103
fix for safe url
Dec 12, 2023
1495f36
Merge pull request #1 from martialo12/fix/safe_url
ccancellieri Dec 12, 2023
3e14699
update requirements
Dec 12, 2023
8185f70
Merge pull request #2 from martialo12/fix/requirements
ccancellieri Dec 12, 2023
84fa374
update generate_signed_url (#3)
martialo12 Dec 18, 2023
01a47dc
Feat/gcp group and bucket (#4)
martialo12 Jan 25, 2024
a24be3c
Feat/gcp group and bucket (#6)
martialo12 Jan 25, 2024
a3ea3cb
Fix/refacto etl (#7)
martialo12 Jan 26, 2024
d261ad5
retrieve org members and org description for single organization (#8)
martialo12 Jan 29, 2024
59058c9
add better exceptions handling (#9)
martialo12 Jan 30, 2024
207d9a8
Fix/handle exception (#10)
martialo12 Jan 30, 2024
83cde58
fix upload
ccancellieri Jan 31, 2024
5113d6f
Fix/bucket bug (#11)
martialo12 Jan 31, 2024
b6e8758
Fix/bucket bug (#12)
martialo12 Jan 31, 2024
7239818
handle 404 status code (#13)
martialo12 Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 99 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,53 @@
# ckanext-cloudstorage

Implements support for using S3, Azure, or any of 15 different storage
providers supported by [libcloud][] to [CKAN][].
`ckanext-cloudstorage` is a plugin for CKAN that enhances its capabilities by enabling the use of various cloud storage services. It supports integration with over 15 different storage providers, including Amazon S3, Google Cloud Storage, and Azure, via [libcloud][]. This flexibility allows CKAN[CKAN][] to leverage the robustness and scalability of these cloud storage solutions

# Setup
## Features

- **Google Storage bucket integration**: You have the ability to upload files to Google Cloud Platform (GCP) Bucket Storage and download files from GCP Workspace Storage.
- **GCP group management**: In GCP Workspace, you have the capability to administer groups efficiently. This includes creating and deleting groups, as well as adding and removing members from these groups...
- **Manage IAM permission**: ou have the capability to set IAM permissions for GCP Storage Buckets and configure group permissions, allowing for effective management of access control to these storage resources

Most libcloud-based providers should work out of the box, but only those listed
below have been tested:

| Provider | Uploads | Downloads | Secure URLs (private resources) |
| --- | --- | --- | --- |
| Google Bucket | YES | YES | YES (if `google-auth` and `six>=1.5` is installed)


## Prerequisites

- Python 2.7
- Google Workspace domain with admin access
- Service account with domain-wide delegation and the necessary permissions


## Installation

Fork the repository[repository][https://github.com/ccancellieri/ckanext-cloudstorage] and clone to your local machine and switch to `google-cloud-support` branch


## Setup
After installing `ckanext-cloudstorage`, add it to your list of plugins in
your `.ini`:

```bash
ckan.plugins = stats cloudstorage

```

If you haven't already, setup [CKAN file storage][ckanstorage] or the file
upload button will not appear.

Every driver takes two options, regardless of which one you use. Both
the name of the driver and the name of the container/bucket are
case-sensitive:

ckanext.cloudstorage.driver = AZURE_BLOBS
```bash
ckanext.cloudstorage.driver = GOOGLE_STORAGE
ckanext.cloudstorage.container_name = demo
```

You can find a list of driver names [here][storage] (see the `Provider
Constant` column.)
Expand All @@ -27,20 +56,11 @@ Each driver takes its own setup options. See the [libcloud][] documentation.
These options are passed in using `driver_options`, which is a Python dict.
For most drivers, this is all you need:

```bash
ckanext.cloudstorage.driver_options = {"key": "<your public key>", "secret": "<your secret key>"}
```

# Support

Most libcloud-based providers should work out of the box, but only those listed
below have been tested:

| Provider | Uploads | Downloads | Secure URLs (private resources) |
| --- | --- | --- | --- |
| Azure | YES | YES | YES (if `azure-storage` is installed) |
| AWS S3 | YES | YES | YES (if `boto` is installed) |
| Rackspace | YES | YES | No |

# What are "Secure URLs"?
### What are "Secure URLs"?

"Secure URLs" are a method of preventing access to private resources. By
default, anyone that figures out the URL to your resource on your storage
Expand All @@ -49,9 +69,9 @@ instead let ckanext-cloudstorage generate temporary, one-use URLs to download
the resource. This means that the normal CKAN-provided access restrictions can
apply to resources with no further effort on your part, but still get all the
benefits of your CDN/blob storage.

ckanext.cloudstorage.use_secure_urls = 1

```bash
ckanext.cloudstorage.use_secure_urls = True
```
This option also enables multipart uploads, but you need to create database tables
first. Run next command from extension folder:
`paster cloudstorage initdb -c /etc/ckan/default/production.ini `
Expand All @@ -60,37 +80,83 @@ With that feature you can use `cloudstorage_clean_multipart` action, which is av
only for sysadmins. After executing, all unfinished multipart uploads, older than 7 days,
will be aborted. You can configure this lifetime, example:

```bash
ckanext.cloudstorage.max_multipart_lifetime = 7
```

## Install the require dependencies

# Migrating From FileStorage
from `ckanext-cloudstorage` folder execute the activate your virtual environment and run the command below:
```bash
pip install -r requirements.txt
```

## Migrating From FileStorage

If you already have resources that have been uploaded and saved using CKAN's
built-in FileStorage, cloudstorage provides an easy migration command.
Simply setup cloudstorage as explained above, enable the plugin, and run the
migrate command. Provide the path to your resources on-disk (the
`ckan.storage_path` setting in your CKAN `.ini` + `/resources`), and
`ckan.storage_path` setting in your CKAN `.ini`), and
cloudstorage will take care of the rest. Ex:

paster cloudstorage migrate <path to files> -c ../ckan/development.ini
Before running etl script make sure you have setup this config values :

```bash
ckanext.cloudstorage.service_account_key_path= {PATH_TO_SECRET_KEY_FILE}
ckanext.cloudstorage.gcp_base_url= {GCP_BASE_URL}
ckan.site_url= {SITE_URL}
ckan.root_path= {ROOT_PATH}
ckan.storage_path={STORAGE_PATH}
ckanext.cloudstorage.prefix={PREFIX}
ckanext.cloudstorage.domain={DOMAIN}
```

# Notes
from `ckanext-cloudstorage` folder execute this command:

```bash
cd ckanext/cloudstorage/etl
```

and then from `etl` folder run the command below:

```bash

python etl_run.py organization_name ckan_api_key configuration_file
```
- Replace `organization_name` with the actual name of the organization you want to process.
- Replace `ckan_api_key` with the actual sysadmin api key of your ckan instance.
- Replace `configuration_file` with the path of your production.ini file.


## Notes

1. You should disable public listing on the cloud service provider you're
using, if supported.
2. Currently, only resources are supported. This means that things like group
and organization images still use CKAN's local file storage.
3. Make sure the vm instance has the correct scopes. If not use this command below to set right scopes:

```bash
gcloud beta compute instances set-scopes [INSTANCE_NAME] --scopes=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/devstorage.full_control [--zone=[ZONE]]

```
and restart the vm instance after to allow changes to be applied.

4. Check if scopes has been apply correctly by using this command below:

# FAQ
```bash

- *DataViews aren't showing my data!* - did you setup CORS rules properly on
your hosting service? ckanext-cloudstorage can try to fix them for you automatically,
run:
gcloud compute instances describe [INSTANCE_NAME] --format='get(serviceAccounts[].scopes[])'

paster cloudstorage fix-cors <list of your domains> -c=<CKAN config>
```

- *Help! I can't seem to get it working!* - send me a mail! [email protected]
## License
This project is licensed under the MIT License - see the LICENSE file for details.

[libcloud]: https://libcloud.apache.org/
[ckan]: http://ckan.org/
[storage]: https://libcloud.readthedocs.io/en/latest/storage/supported_providers.html
[ckanstorage]: http://docs.ckan.org/en/latest/maintaining/filestore.html#setup-file-uploads
## Acknowledgements
- [Google APIs Client Library for Python](https://github.com/googleapis/google-api-python-client)</s>
- [libcloud](https://libcloud.apache.org/)
- [ckan](http://ckan.org/)
- [storage](https://libcloud.readthedocs.io/en/latest/storage/supported_providers.html)
- [ckanstorage](http://docs.ckan.org/en/latest/maintaining/filestore.html#setup-file-uploads)
37 changes: 37 additions & 0 deletions ckanext/cloudstorage/authorization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession


class AuthorizedSessionError(Exception):
"""Custom exception for upload failures."""
pass


def create_id_token_and_auth_session(service_account_json_file, target_audience="https://groups.fao.org"):
"""
Generates an ID token using a GCP service account and makes a POST request.

This function creates an ID token using Google Cloud Platform service account credentials,
and returns an authorized session for making HTTP requests, particularly POST requests.

:param service_account_json_file: Path to the service account key file in JSON format.
It contains credentials for the service account.
:param target_audience: The intended audience (URL) for the ID token. This specifies
the target service or API that the token is intended for.

:return: An instance of `AuthorizedSession` with ID token credentials. This session
can be used for authenticated HTTP requests to the specified target audience.
"""
# Load the service account credentials and create an ID token
try:
credentials = service_account.IDTokenCredentials.from_service_account_file(
service_account_json_file,
target_audience=target_audience
)

# Create an authorized session using the credentials
auth_session = AuthorizedSession(credentials)

return auth_session
except Exception as e:
raise AuthorizedSessionError("Error creating authorized session: {}".format(e))
141 changes: 141 additions & 0 deletions ckanext/cloudstorage/bucket.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
import logging
import os

from google.cloud import storage
from google.cloud.exceptions import NotFound, GoogleCloudError


# Configure logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)


class UploadError(Exception):
"""Custom exception for upload failures."""
pass


class BucketError(Exception):
"""Custom exception for upload failures."""
pass


def create_bucket(bucket_name, cloud_storage=None):
"""
Create a Google Cloud Storage bucket and optionally update CloudStorage instance.

Args:
bucket_name (str): The name of the bucket to be created.
cloud_storage (CloudStorage, optional): Instance to update with the new bucket name.

Returns:
bool: True if bucket is created successfully, False if an error occurs.
"""
try:
storage_client = storage.Client()
bucket = storage_client.create_bucket(bucket_name)
log.info("Bucket {} created".format(bucket.name))

if cloud_storage:
from ckanext.cloudstorage.storage import CloudStorage
if isinstance(cloud_storage, CloudStorage):
cloud_storage.container_name = bucket_name

except Exception as e:
log.error("Error creating bucket: {}".format(e))
raise BucketError("Error creating bucket: {}".format(e))


def check_err_response_from_gcp(response, err_msg):
if "error" in response:
log.error("{}: {}".format(err_msg, response))
raise Exception(response["error"])
return response


def add_group_iam_permissions(bucket_name, group_email):
"""
Grant read and list permissions to a group for a specific Google Cloud Storage bucket.

Args:
bucket_name (str): Name of the Google Cloud Storage bucket.
group_email (str): Email address of the group to grant permissions.

Returns:
bool: True if permissions are added successfully, False otherwise.
"""
storage_client = storage.Client()
try:
# Attempt to get the bucket
bucket = storage_client.get_bucket(bucket_name)
except NotFound:
# This block will execute if the bucket is not found
raise RuntimeError("Bucket '{}' not found.".format(bucket_name))
except Exception as e:
# This block will execute for any other exceptions
raise RuntimeError(
"An error occurred getting bucket info: {}".format(e))

policy = bucket.get_iam_policy()
response = check_err_response_from_gcp(policy, "Error getting Iam policiy")
log.info("Iam policy {}".format(response))

viewer_role = "roles/storage.objectViewer"
policy[viewer_role].add("group:" + group_email)
response = bucket.set_iam_policy(policy)
response = check_err_response_from_gcp(
response, "Error modifying bucket IAM policy")
log.info("Read and list permissions granted to group {} on bucket {}: IAM Policy is now:\n{}"
.format(group_email, bucket_name, response))


def upload_to_gcp_bucket(
bucket_name,
destination_blob_name,
source_file_name,
cloud_storage=None,
group_email=None
):
"""
Uploads a file to the bucket.

:param bucket_name: Name of your bucket.
:param destination_blob_name: Blob name to use for the uploaded file.
:param source_file_name: File to upload.
"""
storage_client = storage.Client()

try:
# Ensure the source file exists
if not os.path.exists(source_file_name):
raise IOError(
"The source file {} does not exist.".format(source_file_name))

# Try to get the bucket, create it if it does not exist
try:
bucket = storage_client.get_bucket(bucket_name)
except NotFound:
log.warning("Bucket {} does not exist, creating it.".format(bucket_name))
create_bucket(bucket_name,cloud_storage)
bucket = storage_client.get_bucket(bucket_name)
add_group_iam_permissions(bucket_name, group_email)

# Create a blob object
blob = bucket.blob(destination_blob_name)

# Attempt to upload the file
blob.upload_from_filename(source_file_name)
except IOError as e:
if e.errno == 2:
# Handle the file not found error specifically
log.error("File not found: {}".format(e))
raise
except GoogleCloudError as e:
# Handle Google Cloud specific exceptions
log.error("An error occurred with Google Cloud Storage: {}".format(e))
raise UploadError("Failed to upload {} to {}/{}".format(
source_file_name, bucket_name, destination_blob_name))
except Exception as e:
# Handle any other exceptions
log.error("An unexpected error occurred: {}".format(e))
raise UploadError("Unexpected error during upload: {}".format(e))
Loading