Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resmgr list-installed only knows about 3 processors with preconfigured resources #1251

Open
bertsky opened this issue Jul 4, 2024 · 4 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jul 4, 2024

But surely, such a user database was at least created by that call. And if you did not run any list-available prior to that, then that database would be just a mirror of the distributed ocrd/resource_list.yml (hence only those 3 processors).

So we just uncovered another serious bug: initialisation does not search the PATH for ocrd-* executables, only list-available does. But without these database entries, list-installed never even attempts to look for other executables!

Originally posted by @bertsky in #1246 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 4, 2024

explanation

ResourceManager inits its database from the predistributed ocrd/resource_list.yml:

self.load_resource_list(Path(RESOURCE_LIST_FILENAME))
if not self.user_list.exists():
if not self.user_list.parent.exists():
self.user_list.parent.mkdir(parents=True)
self.save_user_list()
self.load_resource_list(self.user_list)

New database entries only get made by either

  1. list-available (with some executable glob pattern)
    if dynamic:
    for exec_dir in environ['PATH'].split(':'):
    for exec_path in Path(exec_dir).glob(f'{executable}'):
    self.log.debug(f"Inspecting '{exec_path} --dump-json' for resources")
    ocrd_tool = get_ocrd_tool_json(exec_path)
    for resdict in ocrd_tool.get('resources', ()):
    if exec_path.name not in database:
    database[exec_path.name] = []
    database[exec_path.name].insert(0, resdict)
    database = self._dedup_database(database)
  2. list-installed (when explicitly naming the executable)
    resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)

(So not even a download ensures the respective entry exists!)

However, list-installed only lists models found for processors in the database, plus any found under XDG_DATA_HOME (data location) and /usr/local/share (system location).

all_executables = list(self.database.keys())
# resources in the file system
parent_dirs = [join(x, 'ocrd-resources') for x in [self.xdg_data_home, '/usr/local/share']]
for parent_dir in parent_dirs:
if Path(parent_dir).exists():
all_executables += [x for x in listdir(parent_dir) if x.startswith('ocrd-')]

So it does not cover:

  • processors that (only) use a module location for resources
  • processors other than the 3 in ocrd/resource_list.yml if XDG_DATA_HOME is just a symlink (as is the case in ocrd/all Docker)

expectation

list-installed * or just list-installed (without a name) should look for all executables in PATH, regardless of existing database entries.

Perhaps, considering #1250, we could make an exception if some ocrd-all-tool.json is installed: in that case, one should not waste time searching PATH, but can just pick the precomputed list.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 4, 2024

No, wait.

  • processors other than the 3 in ocrd/resource_list.yml if XDG_DATA_HOME is just a symlink (as is the case in ocrd/all Docker)

that's not true, it should be indepedendant of whether it's a symlink. More likely, we just ran into OCR-D/ocrd_all#394 again – without noticing.

(So not even a download ensures the respective entry exists!)

If you enter via cli.resmgr.download, then a (dynamic) list_available (creating entries) will be part of the process.

@MehmedGIT
Copy link
Contributor

I have removed the ~/.config/ocrd/resources.yml, then installed the core again from the current master branch. This is the result:

(venv38-core) mm@MM-Notebook:~/repos/core$ ocrd resmgr list-installed
12:38:19.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource '3gs.csv.gz' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/3gs.csv.gz) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.374 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'config.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/config.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'model.zip' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/model.zip) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.402 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'ocrd-cis.jar' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/ocrd-cis.jar) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.418 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'stopwords.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/div/stopwords.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:25.828 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'Fraktur.traineddata' (/home/mm/venv38-all/share/tessdata/Fraktur.traineddata) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.614 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'alto' (/home/mm/venv38-all/share/tessdata/configs/alto) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.644 ERROR ocrd.resource_manager - [ocrd-tesserocr-recognize.2] Additional properties are not allowed ('path' was unexpected)
Traceback (most recent call last):
  File "/home/mm/venv38-core/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/cli/resmgr.py", line 64, in list_installed
    for executable, reslist in resmgr.list_installed(executable):
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 168, in list_installed
    resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 202, in add_to_user_database
    self.load_resource_list(self.user_list)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 84, in load_resource_list
    raise ValueError("Resource list %s is invalid!" % (list_filename))
ValueError: Resource list /home/mm/.config/ocrd/resources.yml is invalid!

I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 5, 2024

I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.

I agree – the user database (as a file) does not seem useful. Any subsequent list-installed will have to do a filesystem search anyway. And we do get lots of false positive entries – like the config/* stuff in Tesseract, or in other cases confusing model directories with model files.

We should also get rid of the preconfigured ocrd/resource_list.yml – ocrd-sbb-binarize model info is outdated, ocrd-cis-ocropy-recognize I have just added to the ocrd-tool.json (just needs an update in ocrd_all), and ocrd-calamari-recognize as soon as OCR-D/ocrd_calamari#112 gets merged and updated in ocrd_all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants