Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background sync for datasets from CommCare HQ #41

Merged
merged 28 commits into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
55aeeed
Prefactor: Fix import
kaapstorm Feb 29, 2024
c02e528
Prefactor: Raise DatabaseMissing
kaapstorm Feb 29, 2024
8a97e4e
Prefactor: Tweaks to `refresh_hq_datasource()`
kaapstorm Feb 29, 2024
c2acbf9
Add DataSetChangeAPI
kaapstorm Feb 29, 2024
9765307
HQRequest helper class
kaapstorm Feb 29, 2024
c05c4da
Add OAuth 2.0 server
kaapstorm Feb 29, 2024
cffc0db
Add migration for OAuth 2.0 server models
kaapstorm Feb 29, 2024
f0276dd
Rename endpoint
kaapstorm Mar 4, 2024
9f4b053
Move `create_domain_client()` into a function
kaapstorm Mar 4, 2024
d30cc79
Generate secret in its own function
kaapstorm Mar 4, 2024
ba6286d
Use `current_app.url_for()`
kaapstorm Mar 4, 2024
875ff55
Fix migration
kaapstorm Mar 4, 2024
0974df5
Delete Alembic-generated README
kaapstorm Mar 4, 2024
3ea55a6
Raise TableMissing exception
kaapstorm Mar 4, 2024
aedcb42
Sort `DOMAIN_EXCLUDED_VIEWS`
kaapstorm Mar 4, 2024
af4878f
README: Setting local user authentication
kaapstorm Mar 5, 2024
2c0c766
Docstring to explain how deletes are represented
kaapstorm Mar 6, 2024
ceb8f80
Add comment to README and config about AUTH_TYPE
kaapstorm Mar 6, 2024
a017594
Clean up unused setting
kaapstorm Mar 6, 2024
163f8c9
Merge branch 'master' into nh/bg_sync
kaapstorm Mar 6, 2024
ab8d89f
pass scheme to url_for
mkangia Mar 8, 2024
029fd9d
Get URL scheme from request
kaapstorm Mar 11, 2024
9042fb6
Determine URL scheme from `request.server`
kaapstorm Mar 11, 2024
d5830d9
More detail in exception message
kaapstorm Mar 13, 2024
96858c4
Debug logging for authlib
kaapstorm Mar 13, 2024
35d447e
What is `request`?
kaapstorm Mar 13, 2024
4c8977a
There is a method for that
kaapstorm Mar 13, 2024
49d575a
Set AUTHLIB_INSECURE_TRANSPORT (Urgh!)
kaapstorm Mar 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 76 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,14 @@ This is a Python package that integrates Superset and CommCare HQ.
Local Development
-----------------

Follow below instructions.
### Preparing CommCare HQ

### Setup env
The 'User configurable reports UI' feature flag must be enabled for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more of an HQ discussion, but should we not have a different FF for this workflow so users don't have to be concerned about the User Configurable Reports UI feature flag until they actually want to user the UCRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can definitely pull CommCare Analytics into the current discussion around UCR feature flags.

domain in CommCare HQ, even if the data sources to be imported were
created by Report Builder, not a UCR.


### Setting up a dev environment

While doing development on top of this integration, it's useful to
install this via `pip -e` option so that any changes made get reflected
Expand Down Expand Up @@ -51,11 +56,12 @@ directly without another `pip install`.
Read through the initialization instructions at
https://superset.apache.org/docs/installation/installing-superset-from-scratch/#installing-and-initializing-superset.

Create the database. These instructions assume that PostgreSQL is
running on localhost, and that its user is "commcarehq". Adapt
accordingly:
Create a database for Superset, and a database for storing data from
CommCare HQ. Adapt the username and database names to suit your
environment.
```bash
$ createdb -h localhost -p 5432 -U commcarehq superset_meta
$ createdb -h localhost -p 5432 -U postgres superset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we rename superset_meta to superset?

$ createdb -h localhost -p 5432 -U postgres superset_hq_data
```

Set the following environment variables:
Expand All @@ -64,10 +70,17 @@ $ export FLASK_APP=superset
$ export SUPERSET_CONFIG_PATH=/path/to/superset_config.py
```

Initialize the database. Create an administrator. Create default roles
Set this environment variable to allow OAuth 2.0 authentication with
CommCare HQ over insecure HTTP. (DO NOT USE THIS IN PRODUCTION.)
```bash
$ export AUTHLIB_INSECURE_TRANSPORT=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this and OAUTHLIB_INSECURE_TRANSPORT needed if we are using OAuth with CommCareHQ environment available locally?
If we are providing this option to communicate with live CommCareHQ environments, we probably should not.

```

Initialize the databases. Create an administrator. Create default roles
and permissions:
```bash
$ superset db upgrade
$ superset db upgrade --directory hq_superset/migrations/
$ superset fab create-admin
$ superset load_examples # (Optional)
$ superset init
Expand All @@ -78,31 +91,34 @@ You should now be able to run superset using the `superset run` command:
```bash
$ superset run -p 8088 --with-threads --reload --debugger
```
However, OAuth login does not work yet as hq-superset needs a Postgres
database created to store CommCare HQ data.

You can now log in as a CommCare HQ web user.

In order for CommCare HQ to sync data source changes, you will need to
allow OAuth 2.0 authentication over insecure HTTP. (DO NOT USE THIS IN
PRODUCTION.) Set this environment variable in your CommCare HQ Django
server. (Yes, it's "OAUTHLIB" this time, not "AUTHLIB" as before.)
```bash
$ export OAUTHLIB_INSECURE_TRANSPORT=1
```


### Create a Postgres Database Connection for storing HQ data
### Logging in as a local admin user

- Create a Postgres database. e.g.
```bash
$ createdb -h localhost -p 5432 -U commcarehq hq_data
```
- Log into Superset as the admin user created in the Superset
installation and initialization. Note that you will need to update
`AUTH_TYPE = AUTH_DB` to log in as admin user. `AUTH_TYPE` should be
otherwise set to `AUTH_OAUTH`.
mkangia marked this conversation as resolved.
Show resolved Hide resolved
- Go to 'Data' -> 'Databases' or http://127.0.0.1:8088/databaseview/list/
- Create a database connection by clicking '+ DATABASE' button at the top.
- The name of the DISPLAY NAME should be 'HQ Data' exactly, as this is
the name by which this codebase refers to the Postgres DB.
There might be situations where you need to log into Superset as a local
admin user, for example, to add a database connection. To enable local
user authentication, in `superset_config.py`, set
`AUTH_TYPE = AUTH_DB`.

OAuth integration should now be working. You can log in as a CommCare
HQ web user.
Doing this will prevent CommCare HQ users from logging in, so it should
only be done in production environments when CommCare Analytics is not
in use.

To return to allowing CommCare HQ users to log in, set it back to
`AUTH_TYPE = AUTH_OAUTH`.
mkangia marked this conversation as resolved.
Show resolved Hide resolved

### Importing UCRs using Redis and Celery

### Importing UCRs using Redis and Celery

Celery is used to import UCRs that are larger than
`hq_superset.views.ASYNC_DATASOURCE_IMPORT_LIMIT_IN_BYTES`. If you need
Expand Down Expand Up @@ -137,6 +153,41 @@ code you want to test will need to be in a module whose dependencies
don't include Superset.


### Creating a migration

You will need to create an Alembic migration for any new SQLAlchemy
models that you add. The Superset CLI should allow you to do this:

```shell
$ superset db revision --autogenerate -m "Add table for Foo model"
```

However, problems with this approach have occurred in the past. You
might have more success by using Alembic directly. You will need to
modify the configuration a little to do this:

1. Copy the "HQ_DATA" database URI from `superset_config.py`.

2. Paste it as the value of `sqlalchemy.url` in
`hq_superset/migrations/alembic.ini`.

3. Edit `env.py` and comment out the following lines:
```
hq_data_uri = current_app.config['SQLALCHEMY_BINDS'][HQ_DATA]
decoded_uri = urllib.parse.unquote(hq_data_uri)
config.set_main_option('sqlalchemy.url', decoded_uri)
```

Those changes will allow Alembic to connect to the "HD Data" database
without the need to instantiate Superset's Flask app. You can now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then one should uncomment them back afterwards?

autogenerate your new table with:

```shell
$ cd hq_superset/migrations/
$ alembic revision --autogenerate -m "Add table for Foo model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

```


Upgrading Superset
------------------

Expand Down
21 changes: 14 additions & 7 deletions hq_superset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,26 @@ def flask_app_mutator(app):
# return
from superset.extensions import appbuilder

from . import hq_domain, views
from . import api, hq_domain, oauth2_server, views

appbuilder.add_view(views.HQDatasourceView, 'Update HQ Datasource', menu_cond=lambda *_: False)
appbuilder.add_view(views.SelectDomainView, 'Select a Domain', menu_cond=lambda *_: False)
app.before_request_funcs.setdefault(None, []).append(
hq_domain.before_request_hook
)
app.after_request_funcs.setdefault(None, []).append(
hq_domain.after_request_hook
)
appbuilder.add_api(api.OAuth)
appbuilder.add_api(api.DataSetChangeAPI)
oauth2_server.config_oauth2(app)

app.before_request(hq_domain.before_request_hook)
app.after_request(hq_domain.after_request_hook)
app.strict_slashes = False
override_jinja2_template_loader(app)

# A proxy (maybe) is changing the URL scheme from "https" to "http"
# on commcare-analytics-staging.dimagi.com, which breaks the OAuth
# 2.0 secure transport check despite transport being over HTTPS. I
# hate to do this, but werkzeug.contrib.fixers.ProxyFix didn't fix
# it. So I've run out of better options. (Norman 2024-03-13)
os.environ['AUTHLIB_INSECURE_TRANSPORT'] = '1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kaapstorm

Good find on the proxy redirection.

I can make peace with this change to unblock QA/UAT.

It's a positive that this is on analytics server that we have control/access to and not superset.

I was trying to find where in nginx conf, this is getting set and compare that with what we have for HQ though HQ nginx template has a lot happening.

Couple of things I noticed that could be faulty

  1. we have two "server" blocks with the same server name in nginx conf. I am not sure if that is expected/correct
  2. I don't know what the following block does
if (!-f $request_filename) {
      proxy_pass http://superset;
      break;
    }

@sravfeyn assuming you have more insight when this was setup.
Can you see something faulty with the current nginx conf or point out what could redirecting urls to http?
I don't know what this block does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at the nginx config, and realised it will take me a bit longer to change and test it. I'll merge this PR, and I'll create a follow up ticket to see how we can configure nginx so that we can use werkzeug's "ProxyFix" class (or not have to use it at all) instead of using this environment variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @kaapstorm, thanks for taking a look.

I am fine with this merge till this stays on staging and is not released to production environment until the secure fix is done.



def override_jinja2_template_loader(app):
# Allow loading templates from the templates directory in this project as well
Expand Down
69 changes: 69 additions & 0 deletions hq_superset/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import json
from http import HTTPStatus

from flask import jsonify, request
from flask_appbuilder.api import BaseApi, expose
from sqlalchemy.orm.exc import NoResultFound
from superset.superset_typing import FlaskResponse
from superset.views.base import (
handle_api_exception,
json_error_response,
json_success,
)

from .models import DataSetChange
from .oauth2_server import authorization, require_oauth


class OAuth(BaseApi):
mkangia marked this conversation as resolved.
Show resolved Hide resolved

def __init__(self):
super().__init__()
self.route_base = "/oauth"

@expose("/token", methods=('POST',))
def issue_access_token(self):
try:
response = authorization.create_token_response()
except NoResultFound:
return jsonify({"error": "Invalid client"}), 401

if response.status_code >= 400:
return response

data = json.loads(response.data.decode("utf-8"))
return jsonify(data)


class DataSetChangeAPI(BaseApi):
"""
Accepts changes to datasets from CommCare HQ data forwarding
"""

MAX_REQUEST_LENGTH = 10 * 1024 * 1024 # reject JSON requests > 10MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaapstorm How was this limit determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wild guess as to how big a "very big" request might be? I'm open to better ways to determine this value.


def __init__(self):
self.route_base = '/commcarehq_dataset'
self.default_view = 'post_dataset_change'
super().__init__()

@expose('/change/', methods=('POST',))
@handle_api_exception
@require_oauth()
def post_dataset_change(self) -> FlaskResponse:
Copy link
Contributor

@Charl1996 Charl1996 Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaapstorm Quickly remind me again, is this endpoint going to be hit for every row change in a data source? (it seems like it considering that the DataSetChange has a doc_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint will be hit for every change for a doc_id, where a doc_id originally refers to a case or a form. Sometimes one doc_id can result in multiple rows, e.g. a data source definition that pulls out a question inside a repeat group.

if request.content_length > self.MAX_REQUEST_LENGTH:
return json_error_response(
HTTPStatus.REQUEST_ENTITY_TOO_LARGE.description,
status=HTTPStatus.REQUEST_ENTITY_TOO_LARGE.value,
)

try:
request_json = json.loads(request.get_data(as_text=True))
change = DataSetChange(**request_json)
change.update_dataset()
return json_success('Dataset updated')
except json.JSONDecodeError:
return json_error_response(
'Invalid JSON syntax',
status=HTTPStatus.BAD_REQUEST.value,
)
4 changes: 4 additions & 0 deletions hq_superset/const.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# The name of the database for storing data related to CommCare HQ
HQ_DATABASE_NAME = "HQ Data"

OAUTH2_DATABASE_NAME = "oauth2-server-data"
8 changes: 6 additions & 2 deletions hq_superset/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
class DatabaseMissing(Exception):
pass


class HQAPIException(Exception):
pass

Expand All @@ -6,5 +10,5 @@ class OAuthSessionExpired(Exception):
pass


class UnitTestingOnly(Exception):
pass
class TableMissing(Exception):
pass
25 changes: 15 additions & 10 deletions hq_superset/hq_domain.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,17 @@ def after_request_hook(response):


DOMAIN_EXCLUDED_VIEWS = [
"AuthOAuthView.login",
"AuthOAuthView.logout",
"AuthOAuthView.oauth_authorized",
"AuthDBView.logout",
"AuthDBView.login",
"SelectDomainView.list",
"SelectDomainView.select",
"appbuilder.static",
"static",
'AuthDBView.login',
'AuthDBView.logout',
'AuthOAuthView.login',
'AuthOAuthView.logout',
'AuthOAuthView.oauth_authorized',
'DataSetChangeAPI.post_dataset_change',
'OAuth.issue_access_token',
'SelectDomainView.list',
'SelectDomainView.select',
'appbuilder.static',
'static',
]


Expand All @@ -39,7 +41,10 @@ def is_user_admin():
def ensure_domain_selected():
# Check if a hq_domain cookie is set
# Ensure necessary roles, permissions and DB schemas are created for the domain
if is_user_admin() or (request.url_rule and request.url_rule.endpoint in DOMAIN_EXCLUDED_VIEWS):
if is_user_admin() or (
request.url_rule
and request.url_rule.endpoint in DOMAIN_EXCLUDED_VIEWS
):
return
hq_domain = request.cookies.get('hq_domain')
valid_domains = user_domains()
Expand Down
30 changes: 30 additions & 0 deletions hq_superset/hq_requests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import superset
from hq_superset.oauth import get_valid_cchq_oauth_token


class HQRequest:

def __init__(self, url):
self.url = url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this Request class for CommCareHQ.

nit: passing the url to the request object makes it tricky to use it for another request. So, may be passing the url as an argument to the get and post would make it possible to reuse existing request objects since all other context would remain the same. One could update the url for request which seems not a good thing to do.


@property
def oauth_token(self):
return get_valid_cchq_oauth_token()

@property
def commcare_provider(self):
return superset.appbuilder.sm.oauth_remotes["commcare"]

@property
def api_base_url(self):
return self.commcare_provider.api_base_url

@property
def absolute_url(self):
return f"{self.api_base_url}{self.url}"

def get(self):
return self.commcare_provider.get(self.url, token=self.oauth_token)

def post(self, data):
return self.commcare_provider.post(self.url, data=data, token=self.oauth_token)
25 changes: 25 additions & 0 deletions hq_superset/hq_url.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
"""
Functions that return URLs on CommCare HQ
"""


def datasource_details(domain, datasource_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like that these functions are under urls.py though it would be helpful to name these as something_url so that when they are used elsewhere, its clear that the function returns a URL.

return f"a/{domain}/api/v0.5/ucr_data_source/{datasource_id}/"


def datasource_list(domain):
return f"a/{domain}/api/v0.5/ucr_data_source/"


def datasource_export(domain, datasource_id):
return (
f"a/{domain}/configurable_reports/data_sources/export/{datasource_id}/"
"?format=csv"
)


def datasource_subscribe(domain, datasource_id):
return (
f"a/{domain}/configurable_reports/data_sources/subscribe/"
f"{datasource_id}/"
)
Loading
Loading