Custom scraper

Before you start

Here we go through how to create a custom scraper plugin for AKL. We assume you already know all the basics about how to code with python, how to create kodi addons and how to work with git. On this page we only go in to the specifics about the needed code, so if not prepared, start here.

Code examples

A good example is the default plugin for AKL. Get the code here. We will refer to this codebase.

The scraper base class: Scraper

The implementation of the scraper is more elaborate than that of the launcher or scanner. The scraper class takes care of the process of finding metadata and assets that correspond with the given ROM entries. During this process it might scan disks or web sources. Depending on the source it might need to copy or download the files to the correct directories on your setup. Scrapers are not associated or stored with a specific ROM or ROM collection, so there is no configuration options available. It is simply running the scraper for a collection or a particular ROM. However, the way you run it for that item can be highly configurable.

Constructor / Initialization

The constructor for scrapers only needs a path to the caching directory it can use during the scraping. If the path does not exists it will be automatically created by the scraper base class. The path needs to be of the io.FileName type.

    # --- Constructor ----------------------------------------------------------------------------
    # @param cache_dir: [io.FileName] Path to scraper cache dir.
    def __init__(self, cache_dir:io.FileName):    
      ..

So creating a new scraper class is not that exciting. However executing the scraper is a bit more work. The scraper needs to be applied to a ScraperStrategy which is a piece of logic that actually manages the whole process of going through ROMs, assets etc. and keeping track of the progress. The scraper implementation will have the logic of checking a source for data about a certain game and retrieve that data, the strategy is maintaining the complete process of scraping.
To get the job done you will need to provide the ScraperStrategy with the webserver arguments retrieved from the initial addon call, the scraper settings from the --settings argument, a new progress dialog and of course your scraper implementation. After that depending if the romcollection argument is set or only the rom argument, you will need to instruct the scraper strategy to do a multiple or single scrape action. Of course process the scraped results afterwards.
It will look something like:

def run_scraper(args):
  pdialog  = kodi.ProgressDialog()
  settings = ScraperSettings.from_settings_dict(args.settings)

  scraper_strategy    = ScrapeStrategy(
                            args.server_host, 
                            args.server_port, 
                            settings, 
                            LocalFilesScraper(), 
                            pdialog)

  if args.rom_id is not None:
    scraped_rom = scraper_strategy.process_single_rom(args.rom_id)
    pdialog.endProgress()
    pdialog.startProgress('Saving ROM in database ...')
    scraper_strategy.store_scraped_rom(args.akl_addon_id, args.rom_id, scraped_rom)
    pdialog.endProgress()

  if args.romcollection_id is not None:
    scraped_roms = scraper_strategy.process_collection(args.romcollection_id)
    pdialog.endProgress()
    pdialog.startProgress('Saving ROMs in database ...')
    scraper_strategy.store_scraped_roms(args.akl_addon_id, args.romcollection_id, scraped_roms)
    pdialog.endProgress()

Is your scraper ready after this? Maybe we to check if the scraper is ready to work. For example, check if required API keys are configured, etc. If there is some fatal errors in that process we want to deactivate the scraper. This helps in not abusing the source that will be scraped. We have the following method you can override and do your stuff in.
def check_before_scraping(self, status_dic) -> dict
It returns the status_dic in normal state if everything is ok.

General information

These methods will probably be deprecated soon enough and will get the data from the settings file etc.
get_name() : A simple method to give back the 'friendly' name of this addon.
get_filename() : The name to be used for cache files for this particular scraper.

Support methods

Several methods need to be implemented to give information back about what your scraper is capable of. This will help the scraperstrategy class to execute the correct actions.

def supports_disk_cache(self) -> bool
Returns boolean flag indicating if your scrapers uses disk cache.

def supports_search_string(self) -> bool
Returns boolean flag indicating if your scraper supports free text searching. This will make it possible for the user to simply insert a text and the scraper will look for matching hits.

def supports_metadata_ID(self, metadata_ID) -> list[str]
Returns an array of metadata IDs that this scraper supports, can search for.

def supports_metadata(self) -> bool
Flag to indicate if this scraper supports metadata in general.

def supports_asset_ID(self, asset_ID) -> list[str]
Returns an array of assets IDs that this scraper supports, can search for.

def supports_assets(self) -> bool
Flag to indicate if this scraper supports assets in general.

Scraping

The following methods need to be implemented to do the actual scraping for the desired items.

def get_candidates(self, search_term, rom: ROMObj, platform, status_dic)
Search for candidates and return a list of dictionaries _new_candidate_dic().

This function is never cached. What is cached is the chosen candidate games.
If no candidates found by the scraper return an empty list and status is True.
If there is an error/exception (network error, bad data returned) return None, the cause is printed in the log, status is False and the status dictionary contains a user notification.
The number of network error/exceptions is recorded internally by the scraper. If the number of errors is bigger than a threshold, the scraper is deactivated (no more errors reported in the future, just empty data is returned).
If the scraper is overloaded (maximum number of API/web requests) it is considered and error and the scraper is internally deactivated immediately. The error message associated with the scraper overloading must be printed once like any other error.

Details:

@param search_term: [str] String to be searched.
@param rom: [ROMObj] Scraper gets the known metadata set to use in searching candidates
@param platform: [str] AKL platform.
@param status_dic: [dict] kodi_new_status_dic() status dictionary.
@return: [list] or None.

def get_metadata(self, status_dic)
Returns the metadata for a candidate (search result). See comments in get_candidates()

@param status_dic: [dict] kodi_new_status_dic() status dictionary.
@return: [dict] Dictionary self._new_gamedata_dic().
If no metadata found (very unlikely) then a dictionary with default values is returned.
If there is an error/exception None is returned, the cause printed in the log and status_dic has a message to show.

def get_assets(self, asset_info_id:str, status_dic)
Returns a list of assets for a candidate (search result). See comments in get_candidates()

@param status_dic: [dict] kodi_new_status_dic() status dictionary.
@return: [list] List of _new_assetdata_dic() dictionaries.
If no assets found then an empty list is returned.
If there is an error/exception None is returned, the cause printed in the log and status_dic has a message to show.

def resolve_asset_URL(self, selected_asset, status_dic)
When returning the asset list with get_assets(), some sites return thumbnails images because the real assets are on a single dedicated page. For these sites, resolve_asset_URL() returns the true, full size URL of the asset.

Other scrapers, for example MobyGames, return both the thumbnail and the true asset URLs in get_assets(). In such case, the implementation of this method is trivial.

@param selected_asset:
@param status_dic: [dict] kodi_new_status_dic() status dictionary.
@return: [tuple of strings] or None
First item, string with the URL to download the asset.
Second item, string with the URL for printing in logs. URL may have sensitive information in some scrapers.
None is returned in case of error and status_dic updated.

def resolve_asset_URL_extension(self, selected_asset, image_url, status_dic)
Get the URL image extension. In some scrapers the type of asset cannot be obtained by the asset URL and must be resolved to save the asset in the filesystem.

@param selected_asset:
@param image_url:
@param status_dic: [dict] kodi_new_status_dic() status dictionary.
@return: [str] String with the image extension in lowercase 'png', 'jpg', etc.
None is returned in case or error/exception and status_dic updated.

And that is basically it for the scraper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly