-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Background sync for datasets from CommCare HQ #41
Changes from all commits
55aeeed
c02e528
8a97e4e
c2acbf9
9765307
c05c4da
cffc0db
f0276dd
9f4b053
d30cc79
ba6286d
875ff55
0974df5
3ea55a6
aedcb42
af4878f
2c0c766
ceb8f80
a017594
163f8c9
ab8d89f
029fd9d
9042fb6
d5830d9
96858c4
35d447e
4c8977a
49d575a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,9 +6,14 @@ This is a Python package that integrates Superset and CommCare HQ. | |
Local Development | ||
----------------- | ||
|
||
Follow below instructions. | ||
### Preparing CommCare HQ | ||
|
||
### Setup env | ||
The 'User configurable reports UI' feature flag must be enabled for the | ||
domain in CommCare HQ, even if the data sources to be imported were | ||
created by Report Builder, not a UCR. | ||
|
||
|
||
### Setting up a dev environment | ||
|
||
While doing development on top of this integration, it's useful to | ||
install this via `pip -e` option so that any changes made get reflected | ||
|
@@ -51,11 +56,12 @@ directly without another `pip install`. | |
Read through the initialization instructions at | ||
https://superset.apache.org/docs/installation/installing-superset-from-scratch/#installing-and-initializing-superset. | ||
|
||
Create the database. These instructions assume that PostgreSQL is | ||
running on localhost, and that its user is "commcarehq". Adapt | ||
accordingly: | ||
Create a database for Superset, and a database for storing data from | ||
CommCare HQ. Adapt the username and database names to suit your | ||
environment. | ||
```bash | ||
$ createdb -h localhost -p 5432 -U commcarehq superset_meta | ||
$ createdb -h localhost -p 5432 -U postgres superset | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did we rename |
||
$ createdb -h localhost -p 5432 -U postgres superset_hq_data | ||
``` | ||
|
||
Set the following environment variables: | ||
|
@@ -64,10 +70,17 @@ $ export FLASK_APP=superset | |
$ export SUPERSET_CONFIG_PATH=/path/to/superset_config.py | ||
``` | ||
|
||
Initialize the database. Create an administrator. Create default roles | ||
Set this environment variable to allow OAuth 2.0 authentication with | ||
CommCare HQ over insecure HTTP. (DO NOT USE THIS IN PRODUCTION.) | ||
```bash | ||
$ export AUTHLIB_INSECURE_TRANSPORT=1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this and |
||
``` | ||
|
||
Initialize the databases. Create an administrator. Create default roles | ||
and permissions: | ||
```bash | ||
$ superset db upgrade | ||
$ superset db upgrade --directory hq_superset/migrations/ | ||
$ superset fab create-admin | ||
$ superset load_examples # (Optional) | ||
$ superset init | ||
|
@@ -78,31 +91,34 @@ You should now be able to run superset using the `superset run` command: | |
```bash | ||
$ superset run -p 8088 --with-threads --reload --debugger | ||
``` | ||
However, OAuth login does not work yet as hq-superset needs a Postgres | ||
database created to store CommCare HQ data. | ||
|
||
You can now log in as a CommCare HQ web user. | ||
|
||
In order for CommCare HQ to sync data source changes, you will need to | ||
allow OAuth 2.0 authentication over insecure HTTP. (DO NOT USE THIS IN | ||
PRODUCTION.) Set this environment variable in your CommCare HQ Django | ||
server. (Yes, it's "OAUTHLIB" this time, not "AUTHLIB" as before.) | ||
```bash | ||
$ export OAUTHLIB_INSECURE_TRANSPORT=1 | ||
``` | ||
|
||
|
||
### Create a Postgres Database Connection for storing HQ data | ||
### Logging in as a local admin user | ||
|
||
- Create a Postgres database. e.g. | ||
```bash | ||
$ createdb -h localhost -p 5432 -U commcarehq hq_data | ||
``` | ||
- Log into Superset as the admin user created in the Superset | ||
installation and initialization. Note that you will need to update | ||
`AUTH_TYPE = AUTH_DB` to log in as admin user. `AUTH_TYPE` should be | ||
otherwise set to `AUTH_OAUTH`. | ||
mkangia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Go to 'Data' -> 'Databases' or http://127.0.0.1:8088/databaseview/list/ | ||
- Create a database connection by clicking '+ DATABASE' button at the top. | ||
- The name of the DISPLAY NAME should be 'HQ Data' exactly, as this is | ||
the name by which this codebase refers to the Postgres DB. | ||
There might be situations where you need to log into Superset as a local | ||
admin user, for example, to add a database connection. To enable local | ||
user authentication, in `superset_config.py`, set | ||
`AUTH_TYPE = AUTH_DB`. | ||
|
||
OAuth integration should now be working. You can log in as a CommCare | ||
HQ web user. | ||
Doing this will prevent CommCare HQ users from logging in, so it should | ||
only be done in production environments when CommCare Analytics is not | ||
in use. | ||
|
||
To return to allowing CommCare HQ users to log in, set it back to | ||
`AUTH_TYPE = AUTH_OAUTH`. | ||
mkangia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Importing UCRs using Redis and Celery | ||
|
||
### Importing UCRs using Redis and Celery | ||
|
||
Celery is used to import UCRs that are larger than | ||
`hq_superset.views.ASYNC_DATASOURCE_IMPORT_LIMIT_IN_BYTES`. If you need | ||
|
@@ -137,6 +153,41 @@ code you want to test will need to be in a module whose dependencies | |
don't include Superset. | ||
|
||
|
||
### Creating a migration | ||
|
||
You will need to create an Alembic migration for any new SQLAlchemy | ||
models that you add. The Superset CLI should allow you to do this: | ||
|
||
```shell | ||
$ superset db revision --autogenerate -m "Add table for Foo model" | ||
``` | ||
|
||
However, problems with this approach have occurred in the past. You | ||
might have more success by using Alembic directly. You will need to | ||
modify the configuration a little to do this: | ||
|
||
1. Copy the "HQ_DATA" database URI from `superset_config.py`. | ||
|
||
2. Paste it as the value of `sqlalchemy.url` in | ||
`hq_superset/migrations/alembic.ini`. | ||
|
||
3. Edit `env.py` and comment out the following lines: | ||
``` | ||
hq_data_uri = current_app.config['SQLALCHEMY_BINDS'][HQ_DATA] | ||
decoded_uri = urllib.parse.unquote(hq_data_uri) | ||
config.set_main_option('sqlalchemy.url', decoded_uri) | ||
``` | ||
|
||
Those changes will allow Alembic to connect to the "HD Data" database | ||
without the need to instantiate Superset's Flask app. You can now | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and then one should uncomment them back afterwards? |
||
autogenerate your new table with: | ||
|
||
```shell | ||
$ cd hq_superset/migrations/ | ||
$ alembic revision --autogenerate -m "Add table for Foo model" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! |
||
``` | ||
|
||
|
||
Upgrading Superset | ||
------------------ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,19 +9,26 @@ def flask_app_mutator(app): | |
# return | ||
from superset.extensions import appbuilder | ||
|
||
from . import hq_domain, views | ||
from . import api, hq_domain, oauth2_server, views | ||
|
||
appbuilder.add_view(views.HQDatasourceView, 'Update HQ Datasource', menu_cond=lambda *_: False) | ||
appbuilder.add_view(views.SelectDomainView, 'Select a Domain', menu_cond=lambda *_: False) | ||
app.before_request_funcs.setdefault(None, []).append( | ||
hq_domain.before_request_hook | ||
) | ||
app.after_request_funcs.setdefault(None, []).append( | ||
hq_domain.after_request_hook | ||
) | ||
appbuilder.add_api(api.OAuth) | ||
appbuilder.add_api(api.DataSetChangeAPI) | ||
oauth2_server.config_oauth2(app) | ||
|
||
app.before_request(hq_domain.before_request_hook) | ||
app.after_request(hq_domain.after_request_hook) | ||
app.strict_slashes = False | ||
override_jinja2_template_loader(app) | ||
|
||
# A proxy (maybe) is changing the URL scheme from "https" to "http" | ||
# on commcare-analytics-staging.dimagi.com, which breaks the OAuth | ||
# 2.0 secure transport check despite transport being over HTTPS. I | ||
# hate to do this, but werkzeug.contrib.fixers.ProxyFix didn't fix | ||
# it. So I've run out of better options. (Norman 2024-03-13) | ||
os.environ['AUTHLIB_INSECURE_TRANSPORT'] = '1' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @kaapstorm Good find on the proxy redirection. I can make peace with this change to unblock QA/UAT. It's a positive that this is on analytics server that we have control/access to and not superset. I was trying to find where in nginx conf, this is getting set and compare that with what we have for HQ though HQ nginx template has a lot happening. Couple of things I noticed that could be faulty
@sravfeyn assuming you have more insight when this was setup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I took a quick look at the nginx config, and realised it will take me a bit longer to change and test it. I'll merge this PR, and I'll create a follow up ticket to see how we can configure nginx so that we can use werkzeug's "ProxyFix" class (or not have to use it at all) instead of using this environment variable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay @kaapstorm, thanks for taking a look. I am fine with this merge till this stays on staging and is not released to production environment until the secure fix is done. |
||
|
||
|
||
def override_jinja2_template_loader(app): | ||
# Allow loading templates from the templates directory in this project as well | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
import json | ||
from http import HTTPStatus | ||
|
||
from flask import jsonify, request | ||
from flask_appbuilder.api import BaseApi, expose | ||
from sqlalchemy.orm.exc import NoResultFound | ||
from superset.superset_typing import FlaskResponse | ||
from superset.views.base import ( | ||
handle_api_exception, | ||
json_error_response, | ||
json_success, | ||
) | ||
|
||
from .models import DataSetChange | ||
from .oauth2_server import authorization, require_oauth | ||
|
||
|
||
class OAuth(BaseApi): | ||
mkangia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def __init__(self): | ||
super().__init__() | ||
self.route_base = "/oauth" | ||
|
||
@expose("/token", methods=('POST',)) | ||
def issue_access_token(self): | ||
try: | ||
response = authorization.create_token_response() | ||
except NoResultFound: | ||
return jsonify({"error": "Invalid client"}), 401 | ||
|
||
if response.status_code >= 400: | ||
return response | ||
|
||
data = json.loads(response.data.decode("utf-8")) | ||
return jsonify(data) | ||
|
||
|
||
class DataSetChangeAPI(BaseApi): | ||
""" | ||
Accepts changes to datasets from CommCare HQ data forwarding | ||
""" | ||
|
||
MAX_REQUEST_LENGTH = 10 * 1024 * 1024 # reject JSON requests > 10MB | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kaapstorm How was this limit determined? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A wild guess as to how big a "very big" request might be? I'm open to better ways to determine this value. |
||
|
||
def __init__(self): | ||
self.route_base = '/commcarehq_dataset' | ||
self.default_view = 'post_dataset_change' | ||
super().__init__() | ||
|
||
@expose('/change/', methods=('POST',)) | ||
@handle_api_exception | ||
@require_oauth() | ||
def post_dataset_change(self) -> FlaskResponse: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kaapstorm Quickly remind me again, is this endpoint going to be hit for every row change in a data source? (it seems like it considering that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This endpoint will be hit for every change for a doc_id, where a doc_id originally refers to a case or a form. Sometimes one doc_id can result in multiple rows, e.g. a data source definition that pulls out a question inside a repeat group. |
||
if request.content_length > self.MAX_REQUEST_LENGTH: | ||
return json_error_response( | ||
HTTPStatus.REQUEST_ENTITY_TOO_LARGE.description, | ||
status=HTTPStatus.REQUEST_ENTITY_TOO_LARGE.value, | ||
) | ||
|
||
try: | ||
request_json = json.loads(request.get_data(as_text=True)) | ||
change = DataSetChange(**request_json) | ||
change.update_dataset() | ||
return json_success('Dataset updated') | ||
except json.JSONDecodeError: | ||
return json_error_response( | ||
'Invalid JSON syntax', | ||
status=HTTPStatus.BAD_REQUEST.value, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# The name of the database for storing data related to CommCare HQ | ||
HQ_DATABASE_NAME = "HQ Data" | ||
|
||
OAUTH2_DATABASE_NAME = "oauth2-server-data" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
import superset | ||
from hq_superset.oauth import get_valid_cchq_oauth_token | ||
|
||
|
||
class HQRequest: | ||
|
||
def __init__(self, url): | ||
self.url = url | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this Request class for CommCareHQ. nit: passing the url to the request object makes it tricky to use it for another request. So, may be passing the url as an argument to the |
||
|
||
@property | ||
def oauth_token(self): | ||
return get_valid_cchq_oauth_token() | ||
|
||
@property | ||
def commcare_provider(self): | ||
return superset.appbuilder.sm.oauth_remotes["commcare"] | ||
|
||
@property | ||
def api_base_url(self): | ||
return self.commcare_provider.api_base_url | ||
|
||
@property | ||
def absolute_url(self): | ||
return f"{self.api_base_url}{self.url}" | ||
|
||
def get(self): | ||
return self.commcare_provider.get(self.url, token=self.oauth_token) | ||
|
||
def post(self, data): | ||
return self.commcare_provider.post(self.url, data=data, token=self.oauth_token) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
""" | ||
Functions that return URLs on CommCare HQ | ||
""" | ||
|
||
|
||
def datasource_details(domain, datasource_id): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I like that these functions are under |
||
return f"a/{domain}/api/v0.5/ucr_data_source/{datasource_id}/" | ||
|
||
|
||
def datasource_list(domain): | ||
return f"a/{domain}/api/v0.5/ucr_data_source/" | ||
|
||
|
||
def datasource_export(domain, datasource_id): | ||
return ( | ||
f"a/{domain}/configurable_reports/data_sources/export/{datasource_id}/" | ||
"?format=csv" | ||
) | ||
|
||
|
||
def datasource_subscribe(domain, datasource_id): | ||
return ( | ||
f"a/{domain}/configurable_reports/data_sources/subscribe/" | ||
f"{datasource_id}/" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more of an HQ discussion, but should we not have a different FF for this workflow so users don't have to be concerned about the
User Configurable Reports UI
feature flag until they actually want to user the UCRs?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can definitely pull CommCare Analytics into the current discussion around UCR feature flags.