-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Manager - Hilltop Crawler Changes * Shuffle files around * Crawler - Updates * Intergration testing * Improve DB naming * Big refactor of test setup * Correct time to wait between quiet polls * Add global error handling around task * Add check to make sure Virtual Measurement matches Measurement Name * Add a bunch to the readme * Add views to aggregate daily allocations and usage by area (#100) * Add views to aggregate consent allocations and usage by day * Start writing tests * Allow easier setup and assertion against test data, start adding more tests * Add tests for various allocation scenarios * Add failing test for aggregating data across different areas * Simplify views * Splitting out the calculation of daily useage by area * Add more assertions for aggregation of different areas * water_allocation.meter should correlate with observation_site.name * Format code * Feedback from PR review --------- Co-authored-by: Steve Mosley <[email protected]> * Remove unnesscary `yield` * Adding index on next_fetch_at column * Add some more readme details about the task queue * Adding notes about partition sizes * Improve note about how schedulign tasks works --------- Co-authored-by: Vimal Jobanputra <[email protected]>
- Loading branch information
Showing
53 changed files
with
3,759 additions
and
547 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Hilltop Crawler | ||
|
||
## Description | ||
|
||
This app extracts observation data from Hilltop servers and publishes them as to a Kafka `observations` topic for other | ||
applications to consume. Which Hilltop servers and which types to extract are configured in a database table which can | ||
be added to while the system is running. | ||
|
||
The app intends to keep all versions of data pulled from the Hilltop. Allowing downstream systems to keep track of when | ||
data has been updated in Hilltop. Noting that because Hilltop does not expose when changes were made we can only capture | ||
when the crawler first saw data change. This is also intended to allow us to update some parts of the system without | ||
having to re-read all of the data from hilltop. | ||
|
||
## Process | ||
|
||
Overall, this is built around the Hilltop API which exposes three levels of GET API request | ||
|
||
* SITES_LIST — List of Sites, includes names and locations | ||
* MEASUREMENTS_LIST — Per site, list of measurements available for that site, including details about the first and last | ||
observation date | ||
* MEASUREMENT_DATA — Per site, per measurement, the timeseries observed data | ||
|
||
The app crawls these by reading each level to decide what to read from the next level. i.e., The SITES_LIST tells the | ||
app which sites call MEASUREMENTS_LIST for which in turn tells which measurements to call MEASUREMENT_DATA for. Requests | ||
for MEASUREMENT_DATA are also split into monthly chunks to avoid issues with too much data being returned in one | ||
request. | ||
|
||
The app keeps a list of API requests that it will keep up to date by calling that API on a schedule. This list is stored | ||
in the `hilltop_fetch_tasks` table and works like a task queue. Each time a request is made, the result is used to try | ||
and determine when next to schedule the task. The simple example is for MEASUREMENT_DATA if the last observation was | ||
recent, then a refresh should be attempted soon, if it was a long way in the past, it should be refreshed less often. | ||
|
||
The next schedule time has been implemented as a random time in an interval, to provide some jitter when between task | ||
requeue times to hopefully spread them out, and the load on the servers we are calling. | ||
|
||
The task queue also keeps meta-data about the previous history of tasks that are not used by the app this is to allow | ||
engineers to monitor how the system is working. | ||
|
||
This process is built around three main components: | ||
|
||
* Every hour, monitor the configuration table | ||
* Read the configuration table | ||
* From the configuration, add any new tasks to the task queue | ||
|
||
* Continuously monitor the task queue | ||
* Read the next task to work on | ||
* Fetch data from hilltop for that task | ||
* If a valid result which is not the same as the previous version | ||
* Queue any new tasks from the result | ||
* Send the result to Kafka `hilltop_raw` topic | ||
* Requeue the task for sometime in the future, based on the type of request | ||
|
||
* Kafka streams component | ||
* Monitor the stream | ||
* For each message map it to either a `site_details` or and `observations` message | ||
|
||
Currently, the Manager component listens to the `observations` topic and stores the data from that in a DB table. | ||
|
||
## Task Queue | ||
|
||
The task queue is currently a custom implementation on top of a single table `hilltop_fetch_tasks`. | ||
|
||
This is a slightly specialized queue where | ||
|
||
* Each task has a scheduled run time backed by the `next_fecth_at` column | ||
* Each time a task runs it will be re-queued for some time in the future | ||
* The same task can be added multiple times and rely on Postgres “ON CONFLICT DO NOTHING” to avoid the task being added | ||
multiple times | ||
|
||
The implementation relies on the postgres `SKIP LOCKED` feature to allow multiple worker threads to pull from the queue | ||
at the same time without getting the same task. | ||
|
||
See this [reference](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/) for | ||
discussion about the `SKIP LOCKED` query. | ||
|
||
The queue implementation is fairly simple for this specific use. If it becomes more of a generic work queue then a | ||
standard implemention such as Quartz might be worthwhile moving to. | ||
|
||
## Example Configuration | ||
|
||
These are a couple of insert statements that are not stored in migration scripts, so developer machines don't index | ||
them by default. | ||
|
||
GW Water Use | ||
|
||
```sql | ||
INSERT INTO hilltop_sources (council_id, hts_url, configuration) | ||
VALUES (9, 'https://hilltop.gw.govt.nz/WaterUse.hts', | ||
'{ "measurementNames": ["Water Meter Volume", "Water Meter Reading"] }'); | ||
``` | ||
|
||
GW Rainfall | ||
|
||
```sql | ||
INSERT INTO hilltop_sources (council_id, hts_url, configuration) | ||
VALUES (9, 'https://hilltop.gw.govt.nz/merged.hts', '{ "measurementNames": ["Rainfall"] }'); | ||
``` | ||
|
||
### TODO / Improvement | ||
|
||
* The `previous_history` on tasks will grow unbounded. This needs to be capped | ||
* The algorithm for determining the next time to schedule an API refresh could be improved, something could be built | ||
from the previous history based on how often data is unchanged. | ||
* To avoid hammering Hilltop there is rate limiting using a "token bucket" library, currently this uses one bucket for | ||
all requests. It could be split to use one bucket per server | ||
* Because of the chunking by month, every time new data arrives we end up storing the whole month up to that new data | ||
again in the `hilltop_raw` topic. This seems wasteful and there are options for cleaning up when a record just | ||
supersedes the previous. But cleaning up would mean losing some knowledge about when a observation was first seen. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
19 changes: 1 addition & 18 deletions
19
packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Application.kt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,16 @@ | ||
package nz.govt.eop.hilltop_crawler | ||
|
||
import java.security.MessageDigest | ||
import org.springframework.boot.autoconfigure.SpringBootApplication | ||
import org.springframework.boot.autoconfigure.jackson.Jackson2ObjectMapperBuilderCustomizer | ||
import org.springframework.boot.context.properties.EnableConfigurationProperties | ||
import org.springframework.boot.runApplication | ||
import org.springframework.context.annotation.Bean | ||
import org.springframework.http.converter.json.Jackson2ObjectMapperBuilder | ||
import org.springframework.kafka.annotation.EnableKafka | ||
import org.springframework.scheduling.annotation.EnableScheduling | ||
|
||
@SpringBootApplication | ||
@EnableScheduling | ||
@EnableKafka | ||
@EnableConfigurationProperties(ApplicationConfiguration::class) | ||
class Application { | ||
|
||
@Bean | ||
fun jsonCustomizer(): Jackson2ObjectMapperBuilderCustomizer { | ||
return Jackson2ObjectMapperBuilderCustomizer { _: Jackson2ObjectMapperBuilder -> } | ||
} | ||
} | ||
class Application {} | ||
|
||
fun main(args: Array<String>) { | ||
System.setProperty("com.sun.security.enableAIAcaIssuers", "true") | ||
runApplication<Application>(*args) | ||
} | ||
|
||
fun hashMessage(message: String) = | ||
MessageDigest.getInstance("SHA-256").digest(message.toByteArray()).joinToString("") { | ||
"%02x".format(it) | ||
} |
32 changes: 32 additions & 0 deletions
32
...ages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/CheckForNewSourcesTask.kt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
package nz.govt.eop.hilltop_crawler | ||
|
||
import java.util.concurrent.TimeUnit | ||
import nz.govt.eop.hilltop_crawler.api.requests.buildSiteListUrl | ||
import nz.govt.eop.hilltop_crawler.db.DB | ||
import nz.govt.eop.hilltop_crawler.fetcher.HilltopMessageType.SITES_LIST | ||
import org.springframework.context.annotation.Profile | ||
import org.springframework.scheduling.annotation.Scheduled | ||
import org.springframework.stereotype.Component | ||
|
||
/** | ||
* This task is responsible for triggering the first fetch task for each source stored in the DB. | ||
* | ||
* It makes sure any new rows added to the DB will start to be pulled from within an hour. | ||
* | ||
* Each time it runs, it will create the initial fetch task for each source found in the DB. This | ||
* relies on the task queue (via "ON CONFLICT DO NOTHING") making sure that duplicate tasks will not | ||
* be created. | ||
*/ | ||
@Profile("!test") | ||
@Component | ||
class CheckForNewSourcesTask(val db: DB) { | ||
|
||
@Scheduled(fixedDelay = 1, timeUnit = TimeUnit.HOURS) | ||
fun triggerSourcesTasks() { | ||
val sources = db.listSources() | ||
|
||
sources.forEach { | ||
db.createFetchTask(DB.HilltopFetchTaskCreate(it.id, SITES_LIST, buildSiteListUrl(it.htsUrl))) | ||
} | ||
} | ||
} |
1 change: 1 addition & 0 deletions
1
packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Constants.kt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
package nz.govt.eop.hilltop_crawler | ||
|
||
const val HILLTOP_RAW_DATA_TOPIC_NAME = "hilltop.raw" | ||
const val OUTPUT_DATA_TOPIC_NAME = "observations" |
128 changes: 0 additions & 128 deletions
128
packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/TaskScheduler.kt
This file was deleted.
Oops, something went wrong.
40 changes: 40 additions & 0 deletions
40
packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/api/HilltopFetcher.kt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
package nz.govt.eop.hilltop_crawler.api | ||
|
||
import io.github.bucket4j.Bandwidth | ||
import io.github.bucket4j.BlockingBucket | ||
import io.github.bucket4j.Bucket | ||
import java.net.URI | ||
import java.time.Duration | ||
import mu.KotlinLogging | ||
import org.springframework.stereotype.Component | ||
import org.springframework.web.client.RestClientException | ||
import org.springframework.web.client.RestTemplate | ||
|
||
/** | ||
* Simple wrapper around RestTemplate to limit the number of requests per second This is to avoid | ||
* overloading the Hilltop servers | ||
* | ||
* Could be extended to have a bucket per host to increase throughput. But currently the bottleneck | ||
* is when initially loading data from Hilltop, which ends up effectively processing one host at a | ||
* time. | ||
*/ | ||
@Component | ||
class HilltopFetcher(val restTemplate: RestTemplate) { | ||
|
||
private final val logger = KotlinLogging.logger {} | ||
|
||
private final val bucketBandwidthLimit: Bandwidth = Bandwidth.simple(20, Duration.ofSeconds(1)) | ||
private final val bucket: BlockingBucket = | ||
Bucket.builder().addLimit(bucketBandwidthLimit).build().asBlocking() | ||
|
||
fun fetch(fetchRequest: URI): String? { | ||
bucket.consume(1) | ||
logger.trace { "Downloading $fetchRequest" } | ||
return try { | ||
restTemplate.getForObject(fetchRequest, String::class.java) | ||
} catch (e: RestClientException) { | ||
logger.info(e) { "Failed to download $fetchRequest" } | ||
null | ||
} | ||
} | ||
} |
Oops, something went wrong.