-
Notifications
You must be signed in to change notification settings - Fork 15
Monitors
Previous Chapter: Graph Managers with Frontera
Spidermon is the standard and complete package for monitoring spiders jobs, raise alerts and generating stats in the context of a single spider job. But when you have multiple spider jobs and scripts, a big view of what is going on in the entire workflow is usually required. And spidermon has no this big view. In addition, events that are usually a bad sign within a single job, like zero items extracted, is not necessarily an error in the context of a big amount of jobs. Lets think for example in a job that makes searches all the time. Some jobs may not yield results. Others does.
shub-workflow
provides a base monitor class, easily configurable, that allows to quickly set up a monitor script that aggregates and generate stats from spiders and scripts stats and logs, over
any custom period of time. The base class monitor already gathers and aggregates some default stats, like scrapy downloader responses, scraped items and jobs. For example, this is the most
basic monitor you can build:
import logging
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
pass
if __name__ == "__main__":
import logging
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
script = Monitor()
script.run()
Then run the help on the resulting script:
> python monitor.py -h
Couldn't find Dash project settings, probably not running in Dash
usage: You didn't set description for this script. Please set description property accordingly. [-h] [--project-id PROJECT_ID] [--flow-id FLOW_ID] [--children-tag CHILDREN_TAG] [--period PERIOD]
[--end-time END_TIME] [--start-time START_TIME]
options:
-h, --help show this help message and exit
--project-id PROJECT_ID
Overrides target project id.
--flow-id FLOW_ID If given, use the given flow id.
--children-tag CHILDREN_TAG, -t CHILDREN_TAG
Additional tag added to the scheduled jobs. Can be given multiple times.
--period PERIOD, -p PERIOD
Time window period in seconds. Default: 86400
--end-time END_TIME, -e END_TIME
End side of the time window. By default it is just now. Format: any string that can be recognized by dateparser.
--start-time START_TIME, -s START_TIME
Starting side of the time window. By default it is the end side minus the period. Format: any string that can be recognized by dateparser.
As you can see from the help, the default time window that the monitor will explore are the last 24 hours, ending right now. But let's say you want to print all stats for all the spiders that ran in your project in the last hour:
> python monitor.py --project-id=<project id> -s "one hour ago"
python monitor.py --project-id=645812 -s "one hour ago"
Couldn't find Dash project settings, probably not running in Dash
2024-08-08 15:46:51,809 shub_workflow.script [WARNING]: SHUB_JOBKEY not set: not running on ScrapyCloud.
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Start time: 2024-08-08 14:46:51.839459
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: End time: 2024-08-08 15:46:51.809693
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking script_logs...
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking scripts_stats...
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking spiders...
2024-08-08 15:46:57,525 shub_workflow.script [INFO]: {'downloader/response_count/35photo': 1655,
'downloader/response_count/500px': 6545,
'downloader/response_count/behance': 9976,
'downloader/response_count/deviantart': 28275,
'downloader/response_count/imgur': 13417,
'downloader/response_count/instagram': 11959,
'downloader/response_count/pinterest': 3543,
'jobs/35photo': 8,
'jobs/500px': 7,
'jobs/behance': 1,
'jobs/deviantart': 60,
'jobs/imgur': 301,
'jobs/instagram': 90,
'jobs/pinterest': 4,
'scraped_items/35photo': 849,
'scraped_items/500px': 12,
'scraped_items/behance': 8606,
'scraped_items/deviantart': 10971,
'scraped_items/imgur': 16415,
'scraped_items/instagram': 8232,
'scraped_items/pinterest': 3188,
'scraped_items/total': 48472,
'scraped_items_ratio/35photo': 1.75,
'scraped_items_ratio/500px': 0.02,
'scraped_items_ratio/behance': 17.75,
'scraped_items_ratio/deviantart': 22.63,
'scraped_items_ratio/imgur': 33.86,
'scraped_items_ratio/instagram': 16.98,
'scraped_items_ratio/pinterest': 6.58}
The same for the entire previous day:
> python monitor.py --project-id=645812 -e "today at 0:00 GMT"
Couldn't find Dash project settings, probably not running in Dash
2024-08-08 16:12:46,632 shub_workflow.script [WARNING]: SHUB_JOBKEY not set: not running on ScrapyCloud.
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Period set: 1 day, 0:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Start time: 2024-08-06 21:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: End time: 2024-08-07 21:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking script_logs...
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking scripts_stats...
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking spiders...
2024-08-08 16:12:57,949 shub_workflow.utils.monitor [INFO]: Read 1000 jobs
2024-08-08 16:13:01,692 shub_workflow.utils.monitor [INFO]: Read 2000 jobs
2024-08-08 16:13:05,377 shub_workflow.utils.monitor [INFO]: Read 3000 jobs
2024-08-08 16:13:09,081 shub_workflow.utils.monitor [INFO]: Read 4000 jobs
2024-08-08 16:13:12,947 shub_workflow.utils.monitor [INFO]: Read 5000 jobs
2024-08-08 16:13:16,638 shub_workflow.utils.monitor [INFO]: Read 6000 jobs
2024-08-08 16:13:20,439 shub_workflow.utils.monitor [INFO]: Read 7000 jobs
2024-08-08 16:13:24,368 shub_workflow.utils.monitor [INFO]: Read 8000 jobs
2024-08-08 16:13:28,422 shub_workflow.utils.monitor [INFO]: Read 9000 jobs
2024-08-08 16:13:31,799 shub_workflow.utils.monitor [INFO]: Read 10000 jobs
2024-08-08 16:13:35,587 shub_workflow.utils.monitor [INFO]: Read 11000 jobs
2024-08-08 16:13:39,580 shub_workflow.utils.monitor [INFO]: Read 12000 jobs
2024-08-08 16:13:43,327 shub_workflow.utils.monitor [INFO]: Read 13000 jobs
2024-08-08 16:13:47,874 shub_workflow.utils.monitor [INFO]: Read 14000 jobs
2024-08-08 16:13:47,905 shub_workflow.script [INFO]: {'downloader/response_count/35photo': 169251,
'downloader/response_count/500px': 26526,
'downloader/response_count/behance': 567736,
'downloader/response_count/deviantart': 1582627,
'downloader/response_count/fandom': 534506,
'downloader/response_count/imgur': 356781,
'downloader/response_count/instagram': 233300,
'downloader/response_count/pexels': 535933,
'downloader/response_count/pinterest': 126707,
'jobs/35photo': 636,
'jobs/500px': 588,
'jobs/behance': 80,
'jobs/deviantart': 2447,
'jobs/fandom': 6,
'jobs/imgur': 8393,
'jobs/instagram': 1695,
'jobs/pexels': 105,
'jobs/pinterest': 172,
'scraped_items/35photo': 87439,
'scraped_items/500px': 200,
'scraped_items/behance': 496376,
'scraped_items/deviantart': 611786,
'scraped_items/fandom': 509872,
'scraped_items/imgur': 513995,
'scraped_items/instagram': 155828,
'scraped_items/pexels': 252080,
'scraped_items/pinterest': 115061,
'scraped_items/total': 2744043,
'scraped_items_ratio/35photo': 3.19,
'scraped_items_ratio/500px': 0.01,
'scraped_items_ratio/behance': 18.09,
'scraped_items_ratio/deviantart': 22.3,
'scraped_items_ratio/fandom': 18.58,
'scraped_items_ratio/imgur': 18.73,
'scraped_items_ratio/instagram': 5.68,
'scraped_items_ratio/pexels': 9.19,
'scraped_items_ratio/pinterest': 4.19}
As you can see, the default stats aggregated are total scraped items, discriminated by spider, responses discriminated by spider, and number of jobs per spider. For all spider jobs that finished in
the selected time window. But if you provide the command line option --flow-id
, or the monitor script is scheduled from a graph manager, the monitor will select only those jobs belonging to the same
flow id.
You can also see that the -s
and -e
options are very flexible for defining specific time windows thanks to its support of any string that dateparser can understand. Notice that the applied time window
is printed in the log (here, the printed hour limit is different in number because the asked end hour is 0 GMT and the test was run in a -3 time zone). In addition, there is an option, -p
in order
to set the size of the time window (in secs) in case the -s
option, which has no default value, is missing. By default, with the time window parameters not specified, the end of the time window is just
now, and the start is just now minus the period provided by -p
, which defaults to 1 day (86400 secs). But whenever -s
is provided, it overrides the value of -p
. An important notion is that,
the time window filters spider and script jobs to collect stats from, based on their finish time. In case of long living scripts, you can generate monitor aggregated stats from the script log. In this case
the time window is used to select log lines based on its time stamp. How to configure the different capturing modes will be explained in the following sections.
The basic instantiation of a monitor from BaseMonitor
class aggregates stats, as we saw above, for any spider job found within the selected conditions (time window, workflow id if any).
But you may want to capture separately the stats from different groups of spiders, and only for specific ones. For these purposes, the monitor has a specialized attribute, target_spider_classes
:
(...)
from scrapy import Spider
(...)
class BaseMonitor(BaseScript):
# a map from spiders classes to check, to a stats prefix to identify the aggregated stats.
target_spider_classes: Dict[Type[Spider], str] = {Spider: ""}
(...)
This attribute is a dictionary from a spider class to a string which adds a prefix on the stats captured from the jobs of the spiders that are a subclass of the indicated class. By default, as
you can see above, the default value of this attribute is {"Spider": ""}
which means that the stats from all spiders are captured, and the resulting aggregated stats are prefixed with an empty
string, that is, no prefix added.
Lets suppose we have a project where we discover url videos from different sources, and another separated spider actually downloads the images from the discovered urls. And we want separated stats for the discovery and for the download. We can define our monitor in this way:
import logging
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
(...)
The resulting stats:
{'discovery/downloader/response_count/tiktok': 183168,
'discovery/downloader/response_count/youtube': 9257802,
'discovery/jobs/tiktok': 6508,
'discovery/jobs/youtube': 10552,
'discovery/scraped_items/tiktok': 156325,
'discovery/scraped_items/total': 8753838,
'discovery/scraped_items/youtube': 8597513,
'discovery/scraped_items_ratio/tiktok': 1.79,
'discovery/scraped_items_ratio/youtube': 98.21,
'vdownload/downloader/response_count/downloader': 322440,
'vdownload/jobs/downloader': 9,
'vdownload/scraped_items/downloader': 319999,
'vdownload/scraped_items/total': 319999,
'vdownload/scraped_items_ratio/downloader': 100.0}
So, discovery spiders stats are prefixed with "discovery/", while downloader stats are prefixed with "vdownload/", so you can see separated the stats from different spider groups of the workflow.
As mentioned before, by default the collected stats from spiders are the jobs, the scraped items and the response counts. But you may also want to collect other stats, and not only standard scrapy ones but
also any custom one added by the spiders. A tuple attribute, target_spider_stats
, is stablished for this purpose:
(...)
class BaseMonitor(BaseScript):
(...)
# stats aggregated from spiders. A tuple of stats prefixes.
target_spider_stats: Tuple[str, ...] = ()
(...)
Lets continue with the previous example, by setting additional collected stats prefixes, in this case, dropped_videos/length_exceeded
,delivered_count/
and delivered_mb/
:
import logging
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
(...)
And here a sample result:
{'discovery/downloader/response_count/tiktok': 179653,
'discovery/downloader/response_count/youtube': 8951749,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 102,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 8707,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 8605,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 24,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 3094,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 3070,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 7,
'discovery/dropped_videos/length_exceeded/30_more/total': 8109,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 8102,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 1904,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 8666,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 6762,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1081,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 8427,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 7346,
'discovery/dropped_videos/length_exceeded/tiktok': 3118,
'discovery/dropped_videos/length_exceeded/total': 37003,
'discovery/dropped_videos/length_exceeded/youtube': 33885,
'discovery/jobs/tiktok': 6382,
'discovery/jobs/youtube': 11424,
'discovery/scraped_items/tiktok': 164604,
'discovery/scraped_items/total': 8461749,
'discovery/scraped_items/youtube': 8297145,
'discovery/scraped_items_ratio/tiktok': 1.95,
'discovery/scraped_items_ratio/youtube': 98.05,
'vdownload/delivered_count/youtube': 272024,
'vdownload/delivered_mb/youtube': 5934998.122000018,
'vdownload/downloader/response_count/downloader': 322440,
'vdownload/jobs/downloader': 9,
'vdownload/scraped_items/downloader': 319999,
'vdownload/scraped_items/total': 319999,
'vdownload/scraped_items_ratio/downloader': 100.0}
We have again aggregated stats grouped by discovery and vdownload and the same default stats already collected in previous examples. And, whenever any of the spiders in the target groups have any stats with the prefixes declared on target spider stats, they are collected and aggregated as well.
Our monitor also supports aggregation of stats generated by scripts. As scripts are so diverse in function, they don't generate stats by default, but every script subclassed from
shub_workflow.script.BaseScript
class does have a stats object that allows the coder to log any required stats, similar to what spiders does natively. In fact the BaseScript
stats collector class is the same defined for spiders, and configured by the project STATS_CLASS
scrapy setting.
So, assuming your scripts generate stats, the shub_workflow
monitor can collect and aggregate them. The configuration attribute for this purpose is target_script_stats
, defined as:
(...)
class BaseMonitor(BaseScript):
(...)
# - a map from script name into a tuple of 2-elem tuples (aggregating stat regex, aggregated stat prefix)
# - the aggregating stat regex is used to match stat on target script
# - the aggregated stat prefix is used to generate the monitor stat. The original stat name is appended to
# the prefix.
# - if a group is present in the regex, its value is used as suffix of the generate stat, instead of
# the complete original stat name.
target_script_stats: Dict[str, Tuple[Tuple[str, str], ...]] = {}
(...)
Continuing with our example:
import logging
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
target_script_stats = {
"py:deliver.py": ((r"^videos/(?!total)(.+)_time_secs", "delivered_time_secs"),),
}
(...)
For having a context, lets suppose that the py:delivery.py
script generates the following stats:
videos/youtube_time_secs
videos/tiktok_time_secs
videos/total_time_secs
We want to capture the first two, but not the total one (otherwise it will be wrongly aggregated as if there were another spider called total
, when it is already a sum of youtube and tiktok ones)
So, the defined target_script_stats
instructs the monitor to:
- collect in the
py:delivery.py
script, stats that matches the regexr"^videos/(.+)_time_secs"
, except the onevideos/total_time_secs
. That is why the final regexr"^videos/(?!total)(.+)_time_secs"
. - it will generate stats with the prefix
delivered_time_secs
, appended by the value of the regex group, in this case eithertiktok
oryoutube
.
The result:
{'delivered_time_secs/total': 91403985,
'delivered_time_secs/twitch': 2206250,
'delivered_time_secs/youtube': 89197735,
'discovery/downloader/response_count/tiktok': 175636,
'discovery/downloader/response_count/youtube': 10131111,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 162,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 8205,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 8043,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 19,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 10019,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 10000,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 10,
'discovery/dropped_videos/length_exceeded/30_more/total': 6291,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 6281,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 2508,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 10860,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 8352,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1501,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 9886,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 8385,
'discovery/dropped_videos/length_exceeded/tiktok': 4200,
'discovery/dropped_videos/length_exceeded/total': 45261,
'discovery/dropped_videos/length_exceeded/youtube': 41061,
'discovery/jobs/tiktok': 6201,
'discovery/jobs/youtube': 10119,
'discovery/scraped_items/tiktok': 191510,
'discovery/scraped_items/total': 9585329,
'discovery/scraped_items/youtube': 9393819,
'discovery/scraped_items_ratio/tiktok': 2.0,
'discovery/scraped_items_ratio/youtube': 98.0,
'vdownload/delivered_count/twitch': 80348,
'vdownload/delivered_count/youtube': 1053336,
'vdownload/delivered_mb/twitch': 1360238.9330000002,
'vdownload/delivered_mb/youtube': 22628197.08300002,
'vdownload/downloader/response_count/downloader': 1400806,
'vdownload/jobs/downloader': 69,
'vdownload/scraped_items/downloader': 1385000,
'vdownload/scraped_items/total': 1385000,
'vdownload/scraped_items_ratio/downloader': 100.0}
Notice the new delivered_time_secs
stats. The total here is not a capture of the videos/total_time_sec
, but the sum of the collected stats for youtube and tiktok.
In your project, you can add stats capture for every script you need, and you can add any amount of pair (regex, stat prefix).
In all the collecting modes we explained above, the monitor collects final stats of spider and script jobs that finished within the selected time window. But in some projects there are long lived scripts that run for days, weeks or even months, for which stats collecting is not adequate. You can still collect stats, but they will reflect only the current status of the job. The history, required for building aggregated stats for specific time windows, is lost. In general, if the time window is much smaller than the life time of the jobs you want to monitor, this problem with stats may always be present. Even for short lived scripts and spiders. However, in the majority of these cases, collecting final stats is what we want and we don't care about its history.
But when we do require to collect historic data, and this is specially the case when there are long running scripts that process data and other jobs in a continuous fashion, the monitor allows
to generate stats from the script logs. For configuring stats collecting from logs, the monitor has the attribute target_script_logs
:
(...)
class BaseMonitor(BaseScript):
(...)
# - a map from script name into a tuple of 2-elem tuples (aggregating log regex, aggregated stat name)
# - the aggregating log regex must match log lines in the target script job, with a group to extract a number
# from it. If not a group number is extracted, the match alone aggregates 1.
# - the final aggregated stat name is the specified aggregated stat name, plus a second non numeric group in the
# match, if exists.
target_script_logs: Dict[str, Tuple[Tuple[str, str], ...]] = {}
(...)
Lets see how this works with an example, continuation of the previos one:
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
target_script_stats = {
"py:deliver.py": ((r"^videos/(?!total)(.+)_time_secs", "delivered_time_secs"),),
}
target_script_logs = {
"py:balancer.py": (
(r"Wrote (\d+).+?([a-z]+)_\d+T\d+.jl.gz", "new_urls"),
(r"Wrote (\d+).+?Filter_([a-z]+)_\d+T\d+.jl.gz", "new_filters_urls"),
),
}
(...)
Here the target_script_logs
instructs the monitor to capture log from the a script named "py:balancer.py". There are two regular expressiones defined. One will aggregate to the new_urls
stat prefix,
while the other to the new_filters_urls
stat prefix. For context, the first regex is intended to match log lines like this one:
Wrote 40000 records to youtube_20240803T160512.jl.gz
and regex groups are defined to match the number (in this case, 40000) and the stat suffix (in this case, youtube
). So, 40000 will be added to a stat called new_urls/youtube
.
The second regex tries to match lines like this one:
Wrote 17000 records to CHNFilter_youtube_20240803T174413.jl.gz
The two groups matches 17000 and youtube
. So in this case, 17000 is added to a stat called new_filters_urls/youtube
. Notice that the first regex also matches the lines matched
by the second regex, but not at the inverse. So the first stat prefix is more general and will aggregate this example line too. This of course is not a requirement. It is only the
case for this tutorial example.
The groups are optional. If no number group is extracted, it will just sum up a unity. If no suffix group is present, then the resulting stat name will be only the prefix. In addition, the number and the suffix groups, in case both are present, can be in any order.
And here the sample result for the example above:
{'delivered_time_secs/total': 94604203,
'delivered_time_secs/twitch': 2315144,
'delivered_time_secs/youtube': 92289059,
'discovery/downloader/response_count/tiktok': 175811,
'discovery/downloader/response_count/youtube': 10141688,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 148,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 7738,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 7590,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 18,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 9736,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 9718,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 9,
'discovery/dropped_videos/length_exceeded/30_more/total': 4575,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 4566,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 2461,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 10287,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 7826,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1471,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 9486,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 8015,
'discovery/dropped_videos/length_exceeded/tiktok': 4107,
'discovery/dropped_videos/length_exceeded/total': 41822,
'discovery/dropped_videos/length_exceeded/youtube': 37715,
'discovery/jobs/tiktok': 6203,
'discovery/jobs/youtube': 10163,
'discovery/scraped_items/tiktok': 188387,
'discovery/scraped_items/total': 9594702,
'discovery/scraped_items/youtube': 9406315,
'discovery/scraped_items_ratio/tiktok': 1.96,
'discovery/scraped_items_ratio/youtube': 98.04,
'vdownload/delivered_count/twitch': 82660,
'vdownload/delivered_count/youtube': 1086359,
'vdownload/delivered_mb/twitch': 1388436.563,
'vdownload/delivered_mb/youtube': 23503451.690000016,
'vdownload/downloader/response_count/downloader': 1443307,
'vdownload/jobs/downloader': 71,
'vdownload/scraped_items/downloader': 1427500,
'vdownload/scraped_items/total': 1427500,
'vdownload/scraped_items_ratio/downloader': 100.0,
'new_filters_urls': 600000,
'new_filters_urls/youtube': 600000,
'new_urls': 739342,
'new_urls/twitch': 139342,
'new_urls/youtube': 600000}
The configuration attributes described in the previous sections covers a broad set of most common use cases, but still may not contemplate other ones. For example, we may want to count the files in some filesystem (i.e. s3, gcs), or we may want to count the number of jobs with some specific argument, or whatever.
The monitor admits two kinds of extensions: job hooks and check methods. Job hooks come in two flavours: spider job hooks and script job hooks.
The base monitor has the following method:
(...)
class BaseMonitor(BaseScript):
(...)
def spider_job_hook(self, jobdict: JobDict):
"""
This is called for every spider job retrieved, from the spiders declared on target_spider_classes,
so additional per job customization can be added.
"""
(...)
which is called on every spider job selected by the target_spider_classes
attribute and matches the other selectors like workflow id or time window. The parameter passed to this method is a
JobDict
object containing the spider
name, spider_args
, scrapystats
, tags
and close_reason
. The default method does nothing. But you can override it.
Example override in our project monitor:
from shub_workflow.script import JobDict
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
(...)
def spider_job_hook(self, jobdict: JobDict):
if jobdict["spider"] == "downloader" and "Filter_youtube_" in jobdict["spider_args"]["source"]:
self.stats.inc_value(
"vdownload/delivered_count/youtube/filters", jobdict["scrapystats"].get("delivered_count/youtube", 0)
)
This example code aggregates the spider job stat delivered_count/youtube
for a specific spider with a specific pattern in spider argument, into the monitor stat
vdownload/delivered_count/youtube/filters
. Notice that basically this is what the combination of target_spider_classes
and target_spider_stats
instructs to do:
get some stat from spiders and aggregate into a target monitor stat. However, with these attributes you cannot select jobs based on spider arguments, so most complex
use cases can be handled via this hook.
Similarly, we have the same hook, but for scripts, in this case for the ones declared on target_script_stats
:
(...)
class BaseMonitor:
(...)
def script_job_hook(self, jobdict: JobDict):
"""
This is called for every script job retrieved, from the scripts declared on target_script_stats,
so additional per job customization can be added
"""
(...)
If you don't need to aggregate any specific stat using the target_script_stats
, and only need to declare a script only to be handled by this hook, you can just set the
value of the corresponding entry to an empty tuple:
from shub_workflow.script import JobDict
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
target_script_stats = {
"py:myscript.py": ()
}
def script_job_hook(self, jobdict: JobDict):
if jobdict["spider"] == "py:myscript.py":
(...)
If you have a single script declared on target_script_stats
, it may seem redundant to check in the script hook if the job script name is the one you want to target. But if you later add
more scripts to target_script_stats
, you may forgot that you have the script job hook. So for safety it is not bad to check the script name anyway.