Skip to content

Commit

Permalink
Merge pull request #39 from OCHA-DAP/feature/HDX-10069_scanning_for_c…
Browse files Browse the repository at this point in the history
…srf_token

HDX-10119 Add "scan" functionality for analytical and maintenance operations on all of HDX
  • Loading branch information
IanHopkinson authored Sep 14, 2024
2 parents 5d97ea4 + d2a4e89 commit d9b253e
Show file tree
Hide file tree
Showing 13 changed files with 816 additions and 336 deletions.
4 changes: 3 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,7 @@
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.analysis.typeCheckingMode": "basic",
"python.analysis.typeCheckingMode": "basic",
"makefile.configureOnOpen": false,

}
36 changes: 4 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ This toolkit provides a commandline interface to the [Humanitarian Data Exchange
list List datasets in HDX
print Print datasets in HDX to the terminal
quickcharts Upload QuickChart JSON description to HDX
remove_extras_key Remove extras key from a dataset
scan Scan all of HDX and perform an action
showcase Upload showcase to HDX
update Update datasets in HDX
update_resource Update a resource in HDX
Expand Down Expand Up @@ -50,39 +52,9 @@ hdx-cli-toolkit:

## Usage

The `hdx-toolkit` is built using the Python `click` library. Details of the currently implemented commands can be revealed by running `hdx-toolkit --help`:

```
$ hdx-toolkit --help
Usage: hdx-toolkit [OPTIONS] COMMAND [ARGS]...
Tools for Commandline interactions with HDX
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
configuration Print configuration information to terminal
download Download dataset resources from HDX
get_organization_metadata Get an organization id and other metadata
get_user_metadata Get user id and other metadata
list List datasets in HDX
print Print datasets in HDX to the terminal
quickcharts Upload QuickChart JSON description to HDX
remove_extras_key Remove extras key from a dataset
showcase Upload showcase to HDX
update Update datasets in HDX
update_resource Update a resource in HDX
```

And details of the arguments for a command can be found using:

```shell
hdx-toolkit [COMMAND] --help
```
The `hdx-toolkit` is built using the Python `click` library. Details of the currently implemented commands can be revealed by running `hdx-toolkit --help`, and details of the arguments for a command can be found using `hdx-toolkit [COMMAND] --help`

A detailed walk through of commands can be found in the [DEMO.md](DEMO.md) file
A detailed guide can be found in the [USERGUIDE.md](USERGUIDE.md) file

## Contributions

Expand Down
71 changes: 60 additions & 11 deletions DEMO.md → USERGUIDE.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,25 @@
# Demo Script
# User Guide

## Motivations
## Overview

The original motivations for developing this tool were as follows:
1. A request from DPT to do a bulk quarantine action which was laborious to do manually;
2. A requirement to grab various pieces of HDX data as text for developing pipelines (organization, maintainer ids, datasets as JSON, lists of datasets for organizations...);
3. A one stop shop for "how do I do this?" both for HDX and more generally, including GitHub Actions, Pytest fixtures, mocks, Click CLI.
This toolkit provides a commandline interface to the [Humanitarian Data Exchange](https://data.humdata.org/) (HDX) to allow for bulk modification operations and other administrative activities such as getting `id` values for users and organization. It is useful for those managing HDX and developers building data pipelines for HDX. The currently supported commands are as follows:

In use it has been found to service many DPT requirements unaltered or with minor modifications.
```
configuration Print configuration information to terminal
download Download dataset resources from HDX
get_organization_metadata Get an organization id and other metadata
get_user_metadata Get user id and other metadata
list List datasets in HDX
print Print datasets in HDX to the terminal
quickcharts Upload QuickChart JSON description to HDX
remove_extras_key Remove extras key from a dataset
scan Scan all of HDX and perform an action
showcase Upload showcase to HDX
update Update datasets in HDX
update_resource Update a resource in HDX
```

## Installation (from READ.md)
## Installation (from README.md)
`hdx-cli-toolkit` is a Python application published to the PyPI package repository, therefore it can be installed easily with:

```pip install hdx_cli_toolkit```
Expand All @@ -32,7 +42,7 @@ hdx-cli-toolkit:
user_agent: hdx_cli_toolkit_ih
```

## Walkthrough
## Getting Help

Once installed we can get help for the commands available in the `hdx-toolkit` using:

Expand All @@ -45,6 +55,8 @@ Or for a specific command:
hdx-toolkit list --help
```

## HDX Configuration

Understanding the `Configuration` used by `hdx-python-api` can be challenging for new users, so the `configuration` command will echo the relevant local values (censoring any secrets):

```
Expand All @@ -60,6 +72,8 @@ or `grep` to find particular tags.

The `configuration` command will check the `stage` and `prod` API keys it holds are valid.

## List and Update

The `list` and `update` commands are designed to be used together, using `list` to check what a potentially destructive `update` will do, and then simply repeating the same commandline with `list` replaced with `update`. This commandline selects a single dataset, `mali-healthsites`:

```shell
Expand Down Expand Up @@ -121,6 +135,8 @@ hdx-toolkit list --query=archived:true --key=owner_org
```
There is a guide to the CKAN query language [here](https://github.com/OCHA-DAP/hdx-ckan/blob/dev/ckanext-hdx_theme/docs/search/package_search.rst).

## Organization and User metadata

Another pain point for me is getting an organization id, the `get_organization_metadata` command fixes this. We can just get the id with an organization name, note wildcards are implicit in the organization specification since this is how the CKAN API works:

```shell
Expand Down Expand Up @@ -163,6 +179,8 @@ hdx-toolkit print --dataset_filter=wfp-food-prices-for-nigeria --with_extras

This adds resources under a `resources` key which includes a `quickcharts` key and showcases under a `showcases` key. These new keys mean that the output JSON cannot be created directly in HDX. The `fs_check_info` and `hxl_preview_config` keys which previously contained a JSON object serialised as a single string are expanded as dictionaries so that they are printed out in an easy to read format.

## Quick Charts

A Quick Chart can be uploaded from a JSON file using a commandline like where the `dataset_filter` specifies a single dataset and the `resource_name` specifies the resource to which the Quick Chart is attached:

```
Expand All @@ -171,6 +189,8 @@ A Quick Chart can be uploaded from a JSON file using a commandline like where th

The `hdx_hxl_preview_file_path` points to a JSON format file with the key `hxl_preview_config` which contains the Quick Chart definition. This file is converted to a single string via a temporary yaml file so should be easily readable. Quick Chart recipe documentation can be found [here](https://github.com/OCHA-DAP/hxl-recipes?tab=readme-ov-file). There is an example file in the `hdx-cli-toolkit` [repo](https://github.com/OCHA-DAP/hdx-cli-toolkit/blob/main/tests/fixtures/quickchart-flood.json).

## Showcases

A showcase can be uploaded from attributes found in either a CSV format file like this:
```
dataset_name,timestamp,attribute,value,secondary_value
Expand Down Expand Up @@ -217,21 +237,45 @@ hdx-toolkit update_resource --dataset_name=hdx_cli_toolkit_test --resource_name=

Without the `--live` flag no update on HDX is made.

## Downloading Data
The resources of a dataset can be downloaded with a commandline like:

```shell
hdx-toolkit download --dataset=bangladesh-bgd-attacks-on-protection --resource_filter=* --hdx_site=stage
```
by default files are downloaded to a subdirectory `output` with no download if a file already exists.

## Scan
The `scan` command takes the dataset and resource information returned by the CKAN `package_search` endpoint for all the datasets in HDX and then applies an action to them. The downloaded information can be cached and reloaded from a specified JSON file. This is useful because the full catalogue is approximately 865MB and takes 10 minutes to download.

The supported actions are:
1. `survey` - count the number of occurrences of a key or list of keys across
datasets in HDX
2. `distribution` - calculate the histogram of values for a key across
datasets in HDX
3. `delete_key` - delete occurrences of a key across all datasets in HDX, this
is currently configured so that it only accepts "extras" and
"resource._csrf_token" as valid keys to delete
4. `list` - replicates the list command, providing a table of datasets with values
of selected keys

Examples of invocations of the scan command are as follows:
```
hdx-toolkit scan --hdx_site="stage" --action=survey --key=resources._csrf_token output_path=output/2024-08-25-hdx-snapshot.json --verbose
hdx-toolkit scan --hdx_site="stage" --action=distribution --key=data_update_frequency
hdx-toolkit scan --hdx_site="stage" --input_path=output/2024-08-24-hdx-snapshot.json --action=delete_key --key=extras --verbose
hdx-toolkit scan --hdx_site="stage" --action=list --key=organization.name,data_update_frequency --rows=100
```
## Miscellaneous

There is an issue with some datasets where a key, `extras` is found which is not valid, it prevents
the dataset being updated. The `extras` key be removed from a set of datasets with a commandline
like:
the dataset being updated. The `extras` key be removed from a set of datasets with the `remove_extras_key` command:

```
hdx-toolkit remove_extras_key --organization=healthsites --dataset_filter=*al*-healthsites --hdx_site=stage --output_path=temp.csv
```


## Future Work

Potential new features can be found in the [GitHub issue tracker](https://github.com/OCHA-DAP/hdx-cli-toolkit/issues)
Expand Down Expand Up @@ -260,4 +304,9 @@ hdx-toolkit showcase --showcase_name=climada-litpop-showcase --hdx_site=stage --
hdx-toolkit update_resource --dataset_name=hdx_cli_toolkit_test --resource_name="test_resource_1" --hdx_site=stage --resource_file_path=test-2.csv --live
hdx-toolkit download --dataset=bangladesh-bgd-attacks-on-protection --hdx_site=stage
hdx-toolkit remove_extras_key --organization=healthsites --dataset_filter=*al*-healthsites --hdx_site=stage --output_path=temp.csv
hdx-toolkit scan --hdx_site="stage" --action=survey --key=resources._csrf_token output_path=output/2024-08-25-hdx-snapshot.json --verbose
hdx-toolkit scan --hdx_site="stage" --action=distribution --key=data_update_frequency
hdx-toolkit scan --hdx_site="stage" --input_path=output/2024-08-24-hdx-snapshot.json --action=delete_key --key=extras --verbose
hdx-toolkit scan --hdx_site="stage" --action=list --key=organization.name,data_update_frequency --rows=100
hdx-toolkit scan --hdx_site="stage" --action=list --key=data_update_frequency --input_path=output/2024-08-24-hdx-snapshot.json --result_path=output/2024-09-03-scan-results.csv
```
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "hdx_cli_toolkit"
version = "2024.8.1"
version = "2024.8.2"
description = "HDX CLI tool kit for commandline interaction with HDX"
readme = {file = "README.md", content-type = "text/markdown"}
license = {file = "LICENSE"}
Expand All @@ -13,6 +13,7 @@ authors = [
dependencies = [
"hdx-python-api==6.3.1",
"hdx-python-country",
"ckanapi",
"quantulum3[classifier]", # This stops the UserWarning but has a number of large dependencies
"click",
"hatch",
Expand Down
142 changes: 142 additions & 0 deletions src/hdx_cli_toolkit/ckan_utilities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
#!/usr/bin/env python
# encoding: utf-8

import json
import urllib3

from collections import Counter

import ckanapi

from hdx_cli_toolkit.hdx_utilities import get_hdx_url_and_key, configure_hdx_connection
from hdx_cli_toolkit.utilities import query_dict

DEFAULT_ROW_LIMIT = 100


def fetch_data_from_ckan_package_search(
query_url: str, query: dict, hdx_api_key: str, fetch_all: bool = False
) -> dict:
headers = {
"Authorization": hdx_api_key,
"Content-Type": "application/json",
}

start = 0
if "start" not in query.keys():
query["start"] = start
if "rows" not in query.keys():
query["rows"] = DEFAULT_ROW_LIMIT
payload = json.dumps(query)
i = 1
print(f"{i}. Querying {query_url} with {payload}", flush=True)
response = urllib3.request("POST", query_url, headers=headers, json=query, timeout=20)
full_response_json = json.loads(response.data)
n_expected_result = full_response_json["result"]["count"]

result_length = len(full_response_json["result"]["results"])

if fetch_all:
if result_length != n_expected_result:
while result_length != 0:
i += 1
start += query["rows"]
query["start"] = start
payload = json.dumps(query)
print(f"{i}. Querying {query_url} with {payload}", flush=True)
new_response = urllib3.request(
"POST", query_url, headers=headers, json=query, timeout=20
)
new_response_json = json.loads(new_response.data)
result_length = len(new_response_json["result"]["results"])
full_response_json["result"]["results"].extend(
new_response_json["result"]["results"]
)
else:
print(
f"CKAN API returned all results ({result_length}) on first page of 100", flush=True
)
assert n_expected_result == len(full_response_json["result"]["results"])

return full_response_json


def scan_survey(response: dict, key: str, verbose: bool = False) -> Counter:
key_occurence_counter = Counter()
list_of_keys = key.split(",")

for dataset in response["result"]["results"]:
output_row = {"dataset_name": dataset["name"]}
for key_ in list_of_keys:
output_row[key_] = f"{key_} key absent"
output_rows = query_dict(list_of_keys, dataset, output_row)

for row in output_rows:
for key_ in list_of_keys:
if "key absent" not in str(row[key_]):
key_occurence_counter[key_] += 1
if verbose:
if key_ != "resources.name":
comment = f"has {key_}"
if key_.startswith("resources.") and "resources.name" in row.keys():
print(
f"{dataset['name']} Resource:{row['resources.name']} {comment}",
flush=True,
)
else:
print(f"{dataset['name']} {comment}", flush=True)

return key_occurence_counter


def scan_delete_key(
response: dict, key: str, hdx_site: str = "stage", verbose: bool = False
) -> Counter:
# Does not use query_dict because we want this to be as controlled as possible
configure_hdx_connection(hdx_site, verbose=True)
hdx_site_url, hdx_api_key, user_agent = get_hdx_url_and_key(hdx_site=hdx_site)
ckan = ckanapi.RemoteCKAN(
hdx_site_url,
apikey=hdx_api_key,
user_agent=user_agent,
)

key_occurence_counter = Counter()
for i, dataset in enumerate(response["result"]["results"]):
if key.startswith("resources."):
resource_key = key.split(".")[1]
for resource in dataset["resources"]:
if resource_key in resource.keys():
resource.pop(resource_key)
assert key not in resource.keys()
ckan.action.resource_update(**resource)
key_occurence_counter[key] += 1
if verbose:
comment = f"has {key} - deleted"
print(dataset["name"], flush=True)
print(f"\t{resource['name']} {comment}", flush=True)
else:
if key in dataset.keys():
dataset.pop(key)
assert key not in dataset.keys()
hdx_site_url, hdx_api_key, user_agent = get_hdx_url_and_key(hdx_site=hdx_site)
ckan.action.package_update(**dataset)
key_occurence_counter[key] += 1
if verbose:
comment = f"has {key} - deleted"
print(f"{dataset['name']} {comment}", flush=True)

return key_occurence_counter


def scan_distribution(response: dict, key: str, verbose: bool = False) -> Counter:
value_occurence_counter = Counter()

for i, dataset in enumerate(response["result"]["results"]):
output_row = {key: ""}
output_rows = query_dict([key], dataset, output_row)
for row in output_rows:
if "key absent" not in str(row[key]):
value_occurence_counter[row[key]] += 1

return value_occurence_counter
Loading

0 comments on commit d9b253e

Please sign in to comment.