Merge pull request #39 from OCHA-DAP/feature/HDX-10069_scanning_for_c…

…srf_token HDX-10119 Add "scan" functionality for analytical and maintenance operations on all of HDX
OCHA-DAP · Sep 14, 2024 · d9b253e · d9b253e
2 parents 5d97ea4 + d2a4e89
commit d9b253e
Show file tree

Hide file tree

Showing 13 changed files with 816 additions and 336 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -15,5 +15,7 @@
   ],
   "python.testing.unittestEnabled": false,
   "python.testing.pytestEnabled": true,
-  "python.analysis.typeCheckingMode": "basic", 
+  "python.analysis.typeCheckingMode": "basic",
+  "makefile.configureOnOpen": false,
+
 }
diff --git a/README.md b/README.md
@@ -12,6 +12,8 @@ This toolkit provides a commandline interface to the [Humanitarian Data Exchange
   list                       List datasets in HDX
   print                      Print datasets in HDX to the terminal
   quickcharts                Upload QuickChart JSON description to HDX
+  remove_extras_key          Remove extras key from a dataset
+  scan                       Scan all of HDX and perform an action
   showcase                   Upload showcase to HDX
   update                     Update datasets in HDX
   update_resource            Update a resource in HDX
@@ -50,39 +52,9 @@ hdx-cli-toolkit:
 
 ## Usage
 
-The `hdx-toolkit` is built using the Python `click` library. Details of the currently implemented commands can be revealed by running `hdx-toolkit --help`:
-
-```
-$ hdx-toolkit --help
-Usage: hdx-toolkit [OPTIONS] COMMAND [ARGS]...
-
-  Tools for Commandline interactions with HDX
-
-Options:
-  --version  Show the version and exit.
-  --help  Show this message and exit.
-
-Commands:
-  configuration              Print configuration information to terminal
-  download                   Download dataset resources from HDX
-  get_organization_metadata  Get an organization id and other metadata
-  get_user_metadata          Get user id and other metadata
-  list                       List datasets in HDX
-  print                      Print datasets in HDX to the terminal
-  quickcharts                Upload QuickChart JSON description to HDX
-  remove_extras_key          Remove extras key from a dataset
-  showcase                   Upload showcase to HDX
-  update                     Update datasets in HDX
-  update_resource            Update a resource in HDX
-```
-
-And details of the arguments for a command can be found using:
-
-```shell
-hdx-toolkit [COMMAND] --help
-```
+The `hdx-toolkit` is built using the Python `click` library. Details of the currently implemented commands can be revealed by running `hdx-toolkit --help`, and details of the arguments for a command can be found using `hdx-toolkit [COMMAND] --help`
 
-A detailed walk through of commands can be found in the [DEMO.md](DEMO.md) file
+A detailed guide can be found in the [USERGUIDE.md](USERGUIDE.md) file
 
 ## Contributions
 

diff --git a/DEMO.md → USERGUIDE.md b/DEMO.md → USERGUIDE.md
@@ -1,15 +1,25 @@
-# Demo Script
+# User Guide
 
-## Motivations
+## Overview
 
-The original motivations for developing this tool were as follows:
-1. A request from DPT to do a bulk quarantine action which was laborious to do manually;
-2. A requirement to grab various pieces of HDX data as text for developing pipelines (organization, maintainer ids, datasets as JSON, lists of datasets for organizations...);
-3. A one stop shop for "how do I do this?" both for HDX and more generally, including GitHub Actions, Pytest fixtures, mocks, Click CLI.
+This toolkit provides a commandline interface to the [Humanitarian Data Exchange](https://data.humdata.org/) (HDX) to allow for bulk modification operations and other administrative activities such as getting `id` values for users and organization. It is useful for those managing HDX and developers building data pipelines for HDX. The currently supported commands are as follows:
 
-In use it has been found to service many DPT requirements unaltered or with minor modifications.
+```
+  configuration              Print configuration information to terminal
+  download                   Download dataset resources from HDX
+  get_organization_metadata  Get an organization id and other metadata
+  get_user_metadata          Get user id and other metadata
+  list                       List datasets in HDX
+  print                      Print datasets in HDX to the terminal
+  quickcharts                Upload QuickChart JSON description to HDX
+  remove_extras_key          Remove extras key from a dataset
+  scan                       Scan all of HDX and perform an action
+  showcase                   Upload showcase to HDX
+  update                     Update datasets in HDX
+  update_resource            Update a resource in HDX
+```
 
-## Installation (from READ.md)
+## Installation (from README.md)
 `hdx-cli-toolkit` is a Python application published to the PyPI package repository, therefore it can be installed easily with:
 
 ```pip install hdx_cli_toolkit```
@@ -32,7 +42,7 @@ hdx-cli-toolkit:
     user_agent: hdx_cli_toolkit_ih
 ```
 
-## Walkthrough
+## Getting Help
 
 Once installed we can get help for the commands available in the `hdx-toolkit` using:
 
@@ -45,6 +55,8 @@ Or for a specific command:
 hdx-toolkit list --help
 ```
 
+## HDX Configuration
+
 Understanding the `Configuration` used by `hdx-python-api` can be challenging for new users, so the `configuration` command will echo the relevant local values (censoring any secrets):
 
 ```
@@ -60,6 +72,8 @@ or `grep` to find particular tags.
 
 The `configuration` command will check the `stage` and `prod` API keys it holds are valid. 
 
+## List and Update
+
 The `list` and `update` commands are designed to be used together, using `list` to check what a potentially destructive `update` will do, and then simply repeating the same commandline with `list` replaced with `update`. This commandline selects a single dataset, `mali-healthsites`:
 
 ```shell
@@ -121,6 +135,8 @@ hdx-toolkit list --query=archived:true --key=owner_org
 ```
 There is a guide to the CKAN query language [here](https://github.com/OCHA-DAP/hdx-ckan/blob/dev/ckanext-hdx_theme/docs/search/package_search.rst).
 
+## Organization and User metadata
+
 Another pain point for me is getting an organization id, the `get_organization_metadata` command fixes this. We can just get the id with an organization name, note wildcards are implicit in the organization specification since this is how the CKAN API works:
 
 ```shell
@@ -163,6 +179,8 @@ hdx-toolkit print --dataset_filter=wfp-food-prices-for-nigeria --with_extras
 
 This adds resources under a `resources` key which includes a `quickcharts` key and showcases under a `showcases` key. These new keys mean that the output JSON cannot be created directly in HDX. The `fs_check_info` and `hxl_preview_config` keys which previously contained a JSON object serialised as a single string are expanded as dictionaries so that they are printed out in an easy to read format.
 
+## Quick Charts
+
 A Quick Chart can be uploaded from a JSON file using a commandline like where the `dataset_filter` specifies a single dataset and the `resource_name` specifies the resource to which the Quick Chart is attached:
 
 ```
@@ -171,6 +189,8 @@ A Quick Chart can be uploaded from a JSON file using a commandline like where th
 
 The `hdx_hxl_preview_file_path` points to a JSON format file with the key `hxl_preview_config` which contains the Quick Chart definition. This file is converted to a single string via a temporary yaml file so should be easily readable. Quick Chart recipe documentation can be found [here](https://github.com/OCHA-DAP/hxl-recipes?tab=readme-ov-file). There is an example file in the `hdx-cli-toolkit` [repo](https://github.com/OCHA-DAP/hdx-cli-toolkit/blob/main/tests/fixtures/quickchart-flood.json).
 
+## Showcases
+
 A showcase can be uploaded from attributes found in either a CSV format file like this:
 ```
 dataset_name,timestamp,attribute,value,secondary_value
@@ -217,21 +237,45 @@ hdx-toolkit update_resource --dataset_name=hdx_cli_toolkit_test --resource_name=
 
 Without the `--live` flag no update on HDX is made.
 
+## Downloading Data
 The resources of a dataset can be downloaded with a commandline like:
 
 ```shell
 hdx-toolkit download --dataset=bangladesh-bgd-attacks-on-protection --resource_filter=* --hdx_site=stage
 ```
 by default files are downloaded to a subdirectory `output` with no download if a file already exists.
 
+## Scan
+The `scan` command takes the dataset and resource information returned by the CKAN `package_search` endpoint for all the datasets in HDX and then applies an action to them. The downloaded information can be cached and reloaded from a specified JSON file. This is useful because the full catalogue is approximately 865MB and takes 10 minutes to download.
+
+The supported actions are:
+1. `survey` - count the number of occurrences of a key or list of keys across
+  datasets in HDX
+2. `distribution` - calculate the histogram of values for a key across
+  datasets in HDX
+3. `delete_key` - delete occurrences of a key across all datasets in HDX, this
+  is currently configured so that it only accepts "extras" and
+  "resource._csrf_token" as valid keys to delete
+4. `list` - replicates the list command, providing a table of datasets with values
+  of selected keys
+
+Examples of invocations of the scan command are as follows:
+```
+hdx-toolkit scan --hdx_site="stage" --action=survey --key=resources._csrf_token output_path=output/2024-08-25-hdx-snapshot.json --verbose
+hdx-toolkit scan --hdx_site="stage" --action=distribution --key=data_update_frequency
+hdx-toolkit scan --hdx_site="stage" --input_path=output/2024-08-24-hdx-snapshot.json --action=delete_key --key=extras --verbose
+hdx-toolkit scan --hdx_site="stage" --action=list --key=organization.name,data_update_frequency --rows=100
+```
+## Miscellaneous
+
 There is an issue with some datasets where a key, `extras` is found which is not valid, it prevents
-the dataset being updated. The `extras` key be removed from a set of datasets with a commandline
-like:
+the dataset being updated. The `extras` key be removed from a set of datasets with the `remove_extras_key` command:
 
 ```
 hdx-toolkit remove_extras_key --organization=healthsites --dataset_filter=*al*-healthsites --hdx_site=stage --output_path=temp.csv
 ```
 
+
 ## Future Work
 
 Potential new features can be found in the [GitHub issue tracker](https://github.com/OCHA-DAP/hdx-cli-toolkit/issues)
@@ -260,4 +304,9 @@ hdx-toolkit showcase --showcase_name=climada-litpop-showcase --hdx_site=stage --
 hdx-toolkit update_resource --dataset_name=hdx_cli_toolkit_test --resource_name="test_resource_1" --hdx_site=stage --resource_file_path=test-2.csv --live
 hdx-toolkit download --dataset=bangladesh-bgd-attacks-on-protection --hdx_site=stage
 hdx-toolkit remove_extras_key --organization=healthsites --dataset_filter=*al*-healthsites --hdx_site=stage --output_path=temp.csv
+hdx-toolkit scan --hdx_site="stage" --action=survey --key=resources._csrf_token output_path=output/2024-08-25-hdx-snapshot.json --verbose
+hdx-toolkit scan --hdx_site="stage" --action=distribution --key=data_update_frequency
+hdx-toolkit scan --hdx_site="stage" --input_path=output/2024-08-24-hdx-snapshot.json --action=delete_key --key=extras --verbose
+hdx-toolkit scan --hdx_site="stage" --action=list --key=organization.name,data_update_frequency --rows=100
+hdx-toolkit scan --hdx_site="stage" --action=list --key=data_update_frequency --input_path=output/2024-08-24-hdx-snapshot.json --result_path=output/2024-09-03-scan-results.csv
 ```
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "hdx_cli_toolkit"
-version = "2024.8.1"
+version = "2024.8.2"
 description = "HDX CLI tool kit for commandline interaction with HDX"
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
@@ -13,6 +13,7 @@ authors = [
 dependencies = [
   "hdx-python-api==6.3.1",
   "hdx-python-country",
+  "ckanapi",
   "quantulum3[classifier]", # This stops the UserWarning but has a number of large dependencies
   "click",
   "hatch",

diff --git a/src/hdx_cli_toolkit/ckan_utilities.py b/src/hdx_cli_toolkit/ckan_utilities.py
@@ -0,0 +1,142 @@
+#!/usr/bin/env python
+# encoding: utf-8
+
+import json
+import urllib3
+
+from collections import Counter
+
+import ckanapi
+
+from hdx_cli_toolkit.hdx_utilities import get_hdx_url_and_key, configure_hdx_connection
+from hdx_cli_toolkit.utilities import query_dict
+
+DEFAULT_ROW_LIMIT = 100
+
+
+def fetch_data_from_ckan_package_search(
+    query_url: str, query: dict, hdx_api_key: str, fetch_all: bool = False
+) -> dict:
+    headers = {
+        "Authorization": hdx_api_key,
+        "Content-Type": "application/json",
+    }
+
+    start = 0
+    if "start" not in query.keys():
+        query["start"] = start
+    if "rows" not in query.keys():
+        query["rows"] = DEFAULT_ROW_LIMIT
+    payload = json.dumps(query)
+    i = 1
+    print(f"{i}. Querying {query_url} with {payload}", flush=True)
+    response = urllib3.request("POST", query_url, headers=headers, json=query, timeout=20)
+    full_response_json = json.loads(response.data)
+    n_expected_result = full_response_json["result"]["count"]
+
+    result_length = len(full_response_json["result"]["results"])
+
+    if fetch_all:
+        if result_length != n_expected_result:
+            while result_length != 0:
+                i += 1
+                start += query["rows"]
+                query["start"] = start
+                payload = json.dumps(query)
+                print(f"{i}. Querying {query_url} with {payload}", flush=True)
+                new_response = urllib3.request(
+                    "POST", query_url, headers=headers, json=query, timeout=20
+                )
+                new_response_json = json.loads(new_response.data)
+                result_length = len(new_response_json["result"]["results"])
+                full_response_json["result"]["results"].extend(
+                    new_response_json["result"]["results"]
+                )
+        else:
+            print(
+                f"CKAN API returned all results ({result_length}) on first page of 100", flush=True
+            )
+        assert n_expected_result == len(full_response_json["result"]["results"])
+
+    return full_response_json
+
+
+def scan_survey(response: dict, key: str, verbose: bool = False) -> Counter:
+    key_occurence_counter = Counter()
+    list_of_keys = key.split(",")
+
+    for dataset in response["result"]["results"]:
+        output_row = {"dataset_name": dataset["name"]}
+        for key_ in list_of_keys:
+            output_row[key_] = f"{key_} key absent"
+        output_rows = query_dict(list_of_keys, dataset, output_row)
+
+        for row in output_rows:
+            for key_ in list_of_keys:
+                if "key absent" not in str(row[key_]):
+                    key_occurence_counter[key_] += 1
+                    if verbose:
+                        if key_ != "resources.name":
+                            comment = f"has {key_}"
+                            if key_.startswith("resources.") and "resources.name" in row.keys():
+                                print(
+                                    f"{dataset['name']} Resource:{row['resources.name']} {comment}",
+                                    flush=True,
+                                )
+                            else:
+                                print(f"{dataset['name']} {comment}", flush=True)
+
+    return key_occurence_counter
+
+
+def scan_delete_key(
+    response: dict, key: str, hdx_site: str = "stage", verbose: bool = False
+) -> Counter:
+    # Does not use query_dict because we want this to be as controlled as possible
+    configure_hdx_connection(hdx_site, verbose=True)
+    hdx_site_url, hdx_api_key, user_agent = get_hdx_url_and_key(hdx_site=hdx_site)
+    ckan = ckanapi.RemoteCKAN(
+        hdx_site_url,
+        apikey=hdx_api_key,
+        user_agent=user_agent,
+    )
+
+    key_occurence_counter = Counter()
+    for i, dataset in enumerate(response["result"]["results"]):
+        if key.startswith("resources."):
+            resource_key = key.split(".")[1]
+            for resource in dataset["resources"]:
+                if resource_key in resource.keys():
+                    resource.pop(resource_key)
+                    assert key not in resource.keys()
+                    ckan.action.resource_update(**resource)
+                    key_occurence_counter[key] += 1
+                    if verbose:
+                        comment = f"has {key} - deleted"
+                        print(dataset["name"], flush=True)
+                        print(f"\t{resource['name']} {comment}", flush=True)
+        else:
+            if key in dataset.keys():
+                dataset.pop(key)
+                assert key not in dataset.keys()
+                hdx_site_url, hdx_api_key, user_agent = get_hdx_url_and_key(hdx_site=hdx_site)
+                ckan.action.package_update(**dataset)
+                key_occurence_counter[key] += 1
+                if verbose:
+                    comment = f"has {key} - deleted"
+                    print(f"{dataset['name']} {comment}", flush=True)
+
+    return key_occurence_counter
+
+
+def scan_distribution(response: dict, key: str, verbose: bool = False) -> Counter:
+    value_occurence_counter = Counter()
+
+    for i, dataset in enumerate(response["result"]["results"]):
+        output_row = {key: ""}
+        output_rows = query_dict([key], dataset, output_row)
+        for row in output_rows:
+            if "key absent" not in str(row[key]):
+                value_occurence_counter[row[key]] += 1
+
+    return value_occurence_counter