Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Solr OCR Highlighting Plugin in Search in Document Plugin #587

Merged

Conversation

beatrycze-volk
Copy link
Collaborator

@beatrycze-volk beatrycze-volk commented Feb 11, 2021

The main aim of this PR is to implement Solr OCR Highlighting Plugin inside the existing Search in Document Plugin.

Tasks:

  • adjust SOLR schema
  • remap index fields or possible usage of indexes with different schema than default one
  • remap input fields names to make query parameters configurable
  • add backend classes for handling results returned from SOLR index
  • update SearchInDocumentTool.php and file and its template to pass all needed parameters
  • rewrite SearchInDocument,js file to handle search and processing of the result
    • display result snippets
    • highlight searched word in the full text
    • highlight searched word in the image
  • update documentation for TypoScript config

Changes which should be discussed:

  • Access-Control-Allow-Origin
    Currently in PageViewProxy it is set. It causes problem when this header is also configured in .htaccess.
  • replace of submit button with standard function button in search form
    For unknown reason submit form button with prevent page reload was not working for the partner, so it got changes to standard button which calls a method. It work for the partner now, but it has some implications for e.g. pressing Enter after writing search term simply submits form without executing search function. One of the possible solutions would be also map the key to the function, but it is not sure that there are no other implications not detected currently.

@albig
Copy link
Collaborator

albig commented Feb 11, 2021

What do I have to do to install/activate the OCR Highlighting Plugin? Does the code work, if it's not installed?

@beatrycze-volk
Copy link
Collaborator Author

What do I have to do to install/activate the OCR Highlighting Plugin? Does the code work, if it's not installed?

https://dbmdz.github.io/solr-ocrhighlighting/installation/

I would assume that it doesn't work if not installed. I have only tested with SOLR index which had the plugin installed.

@albig albig added this to the Kitodo.Presentation 4.0.0 milestone Feb 11, 2021
@albig
Copy link
Collaborator

albig commented Feb 11, 2021

What do I have to do to install/activate the OCR Highlighting Plugin? Does the code work, if it's not installed?

https://dbmdz.github.io/solr-ocrhighlighting/installation/

I would assume that it doesn't work if not installed. I have only tested with SOLR index which had the plugin installed.

Ok. Than we have to include this information to the documentation.

And I will setup a testsystem with this Solr plugin. May take some days.

@beatrycze-volk beatrycze-volk force-pushed the use-solr-highlighting branch 2 times, most recently from a32763a to c924cd1 Compare February 15, 2021 12:49
@albig albig added the ⚙ feature A new feature or enhancement. label Feb 17, 2021
@beatrycze-volk
Copy link
Collaborator Author

beatrycze-volk commented Mar 15, 2021

Should I mark it as ready to merge or we also decide that some changes can't be merged in and I will create new branch in which I will implement new stuff without changing old ones (I mean way of submitting the form and cross origin access)?

@beatrycze-volk beatrycze-volk force-pushed the use-solr-highlighting branch from 0498523 to 97ebdaa Compare March 15, 2021 15:08
@beatrycze-volk beatrycze-volk marked this pull request as ready for review March 22, 2021 15:12
@albig
Copy link
Collaborator

albig commented Mar 26, 2021

https://dbmdz.github.io/solr-ocrhighlighting/installation/

I would assume that it doesn't work if not installed. I have only tested with SOLR index which had the plugin installed.

You are right. It doesn't work out-of-the box. The search plugin still finds no snippets of course.The searchInDocument fails with exception.

I still don't like the idea to make the solr-ocrhighlighting plugin obligatory. And I will investigate howto upgrade an existing installation.

Copy link
Collaborator

@albig albig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't manage to get the Solr working with

  • Solr 8.8
  • solr-ocrhighlighting 0.5.0 (current release)
  • schema.xml and solrconfig.xml from this PR
  • updated extension configuration for solr (using default values)

On indexing a document, I end up with the following error

org.apache.solr.common.SolrException: Exception writing document id 36LOG_0003 to the index; possible analysis error.
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:256)
	at org.apache.solr.update.processor.RunUpdateProcessorFactory$RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:73)
...
Caused by: java.lang.RuntimeException: Could not parse pointer: 
	at de.digitalcollections.solrocr.model.SourcePointer.parse(SourcePointer.java:81)
	at de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory.create(ExternalUtf8ContentFilterFactory.java:42)
	at org.apache.solr.analysis.TokenizerChain.initReader(TokenizerChain.java:97)
...

Please check again and tell me your system setup to reproduce.

Classes/Common/SolrSearchResult/ResultDocument.php Outdated Show resolved Hide resolved
Classes/Common/SolrSearchResult/ResultDocument.php Outdated Show resolved Hide resolved
@beatrycze-volk beatrycze-volk force-pushed the use-solr-highlighting branch 2 times, most recently from 200fe81 to a4f5373 Compare March 29, 2021 15:46
{
$id = $this->doc->uid;

if (!empty($this->conf['documentIdUrlSchema'])) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this? Could you please provide an example string to verify the string splitting magic below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes id of the document is actually not an id but a link to xml file. Then to search in the index the full link can't be used so it is needed to extract real id out of the full link.

Example:
https://dev-ddb.fiz-karlsruhe.de/api/items/F45KF24U7UIIVCUJJBV334PD7IPE7JBW/source/record -> https://dev-ddb.fiz-karlsruhe.de/api/items/*id*/source/record

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to distinguish more cases here:

  • The normal case would be documents which are properly indexed in the TYPO3 database. Those don't need any special treatment for their ID anyway.
  • The "DFG-Viewer" case would be documents which are not indexed in the database nor in the Solr index. Normally those can be identified by the fact that they have an URL as ID.
  • Then there is the special "Zeitungsportal" case where we have documents, which are not indexed in the database, but are indexed in the Solr index. Those have an URL as ID as well, but need special treatment to extract the indexed ID from the URL.

I suggest using the TYPO3 hook mechanism to handle all those cases:

  • If a document has a numerical ID, no further treatment is necessary.
  • If a document has a string ID, this string is processed through a hook.
    • If the hook returns an integer, the document is treated as indexed.
    • If the hook returns the string unchanged, the document is considered unindexed and ignored when searching. (This would be the default if no hook is registered, e. g. in "DFG-Viewer mode".)

The advantages would be:

  • We don't need to add Zeitungportal-specific logic to the main extension. Instead the hook can be part of the supplemental Zeitungsportal extension.
  • The solution is extensible, because you can register more than one hook to handle different kinds of identifiers. So anybody with a setup of a pre-existing Solr index could map their document IDs to the index IDs and re-use their Solr index. (This could even be interesting for SLUB, when we think about consolidating our different indices - CMR, Find, Kitodo, ...)
  • Processing the ID string wouldn't be limited to splitting a string, because a hook could include more complex logic like regular expressions or even database lookups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using the TYPO3 hook mechanism to handle all those cases:

Do you mean https://typo3.org/article/how-to-use-existing-hooks-in-your-own-extension ?

If a document has a string ID, this string is processed through a hook.
If the hook returns an integer, the document is treated as indexed.
If the hook returns the string unchanged, the document is considered unindexed and ignored when searching. (This would be the default if no hook is registered, e. g. in "DFG-Viewer mode".)

But how to convert string ID to an integer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have some hooks in Kitodo.Presentation. A good example is the way how we extract the record ID from METS:
Normally we would take /mets:mets[@OBJID] as the document's unique record identifier:

// Check for METS object @ID.
if (!empty($this->mets['OBJID'])) {
$this->recordId = (string) $this->mets['OBJID'];
}

But if this attribute is not set (which is the case for old versions of Kitodo.Production), we use hooks to have a fallback. The hook is registered in ext_localconf.php:
$GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['dlf/Classes/Common/MetsDocument.php']['hookClass'][] = \Kitodo\Dlf\Hooks\KitodoProductionHacks::class;
and called in Classes/Common/MetsDocument.php:
// Get hook objects.
$hookObjects = Helper::getHookObjects('Classes/Common/MetsDocument.php');
// Apply hooks.
foreach ($hookObjects as $hookObj) {
if (method_exists($hookObj, 'construct_postProcessRecordId')) {
$hookObj->construct_postProcessRecordId($this->xml, $this->recordId);
}
}

The hook itself can be found in Classes/Hooks/KitodoProductionHacks.php:
public function construct_postProcessRecordId(\SimpleXMLElement &$xml, &$record_id)
{
if (!$record_id) {
$xml->registerXPathNamespace('mods', 'http://www.loc.gov/mods/v3');
// Get all logical structure nodes with metadata, but without associated METS-Pointers.
if (($divs = $xml->xpath('//mets:structMap[@TYPE="LOGICAL"]//mets:div[@DMDID and not(./mets:mptr)]'))) {
$smLinks = $xml->xpath('//mets:structLink/mets:smLink');
if (!empty($smLinks)) {
foreach ($smLinks as $smLink) {
$links[(string) $smLink->attributes('http://www.w3.org/1999/xlink')->from][] = (string) $smLink->attributes('http://www.w3.org/1999/xlink')->to;
}
foreach ($divs as $div) {
if (!empty($links[(string) $div['ID']])) {
$id = (string) $div['DMDID'];
break;
}
}
}
if (empty($id)) {
$id = (string) $divs[0]['DMDID'];
}
$dmdIds = explode(' ', $id);
foreach ($dmdIds as $dmdId) {
$recordIds = $xml->xpath('//mets:dmdSec[@ID="' . $dmdId . '"]//mods:mods/mods:recordInfo/mods:recordIdentifier');
if (!empty($recordIds)) {
$record_id = (string) $recordIds[0];
break;
}
}
}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook is a possibility. I think the modern way would be to make some "middleware". But I haven't done this before. Hook support itself seem to be not deprecated, although lots of core hooks are.

The DFG-Viewer case is not relevant in the SearchInDocument tool. This tool won't work with it anyway. The DFG-Viewer has it's own SRU search plugin, which requires a SRU server of course.

I suggest to first test and documentate this PR including indexing (doesn't work yet) and searching for Kitodo.Presentation alone. If this works, we can discuss further at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add at least a sample of the documentIdUrlSchema setting. Or is this part of the documentation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be already part of #630. But I still need some (inline) documentation about what is going on here, why we are doing this and what string you expect in documentIdUrlSchema.

@beatrycze-volk
Copy link
Collaborator Author

@albig: Is your error similar to something like that:

org.apache.solr.common.SolrException: Exception writing document id ocrdoc-1 to the index; possible analysis error: Illegal char <:> at index 5: https://digital.slub-dresden.de/data/kitodo/TheDarea_416971482-19100223/TheDarea_416971482-19100223_ocr/00000001.xml at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:247) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)

?

After research I have found out that indexing when XML files are stored in data folder of the core works correctly but when they are passed from external URL the error appears.

I have checked in the GItHub page of the plugin and found dbmdz/solr-ocrhighlighting#49:

Currently we only support loading field values/OCR fragments from the file system.

@beatrycze-volk beatrycze-volk force-pushed the use-solr-highlighting branch 3 times, most recently from 45ebbb9 to 1c7c1f3 Compare April 14, 2021 18:10
@beatrycze-volk
Copy link
Collaborator Author

I'm sorry for this few rebase pushes, I have made a bit of mess and needed to fix it. Now all commits should be correct.

@albig
Copy link
Collaborator

albig commented Apr 15, 2021

@albig: Is your error similar to something like that:

org.apache.solr.common.SolrException: Exception writing document id ocrdoc-1 to the index; possible analysis error: Illegal char <:> at index 5: https://digital.slub-dresden.de/data/kitodo/TheDarea_416971482-19100223/TheDarea_416971482-19100223_ocr/00000001.xml at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:247) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)

?

After research I have found out that indexing when XML files are stored in data folder of the core works correctly but when they are passed from external URL the error appears.

I have checked in the GItHub page of the plugin and found dbmdz/solr-ocrhighlighting#49:

Currently we only support loading field values/OCR fragments from the file system.

Yes, I think this was the error I've seen. So we have to wait for the fix in the ocrhighlighting plugin?

@beatrycze-volk
Copy link
Collaborator Author

Yes, I think this was the error I've seen. So we have to wait for the fix in the ocrhighlighting plugin?

Yes, of course if they are going to fix it actually. That's not so sure after reading the conversation in this issue.

Other possiblity is to store content of the xml files instead pointers to those files.

@beatrycze-volk
Copy link
Collaborator Author

So, the text highlighting only works with the SearchInDocument plugin used together with the pageview plugin AND the TypoScript configuration for the "Toolbox" must be included.

Currently yes. There is also possibility to implement it in the way that coordinates will be already passed in URL parameter.

Nevertheless, it's still not working. I'll add some remarks in the following review on the right lines.

Thanks!

Copy link
Collaborator

@albig albig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have look at my comments.

for(var i = 0; i < queryParams.length; i++) {
var queryParam = queryParams[i].split('=');

if(queryParam[0] === $("input[id='tx-dlf-search-in-document-page']").attr('name')) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison doesn't respect the possible encoding of the parameters. This is wrong on other places, too.

If you click next/last page, the link parameter get's encoded:

/workview?tx_dlf%5Bdouble%5D=0&tx_dlf%5Bhighlight_word%5D=pieschen&tx_dlf%5Bid%5D=1&tx_dlf%5Bpage%5D=3&cHash=04b8a699974161fd87b83c9b36d65483

But 'tx_dlf%5Bpage%5D' === 'tx_dlf[page]' is not true so the highlighting fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting seems to work now. Thank you!

Copy link
Collaborator

@albig albig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the patched ocr-highlighting plugin, I see the toplevel element. But there are some issues with the facets and the filter queries. Please see my comments.

Classes/Common/DocumentList.php Outdated Show resolved Hide resolved
Classes/Common/DocumentList.php Outdated Show resolved Hide resolved
foreach ($filterQueries as $key => $value) {
if (isset($value['query'])) {
$filterQuery[$key] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true';
$query->addFilterQuery($filterQuery);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work either. But I found no way to test this. I think, there is another issue in Search:makeFacetsMenuArray(). So I can't get the available facets from Solr and show the facets menu.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken look to this method. Do you think that it needs to be adjusted now or it can be done later?

Copy link
Collaborator

@albig albig Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addFilterQuery() above has to be fixed. But before doing this, I'm looking for working facets. Currently, the Solr has no facets available and I still don't know, why.

A query like

/solr/dlfCore0/select?facet=on&q=*%3A*&rows=0

Should return something like

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "q":"*:*",
      "rows":"0",
      "facet":"on"}},
  "response":{"numFound":4517,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

In my case - with this PR working - I get

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "rows":"0",
      "facet":"on"}},
  "response":{"numFound":4376,"start":0,"numFoundExact":true,"docs":[]
  }}

That's why no facetting is shown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The faceting is solved (see other commit suggest).

With selecting a facet, we get the (expected) exception:

"A filterquery must have a key value" in Line 283.

The used GET-parameter are:

tx_dlf[fq][0]=collection_faceting:("Projekt: Illustrierte Magazine der Klassischen Moderne")&tx_dlf[fq][1]=type_faceting:("volume")&tx_dlf[query]=*

So please rewrite Line 283.

Classes/Common/DocumentList.php Outdated Show resolved Hide resolved
beatrycze-volk and others added 3 commits August 17, 2021 17:46
Co-authored-by: Alexander Bigga <[email protected]>
Co-authored-by: Alexander Bigga <[email protected]>
Co-authored-by: Alexander Bigga <[email protected]>
Classes/Common/Indexer.php Outdated Show resolved Hide resolved
This line is not needed.
Copy link
Collaborator

@albig albig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the last change request for now. Please have a look at the comments.

foreach ($filterQueries as $key => $value) {
if (isset($value['query'])) {
$filterQuery[$key] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true';
$query->addFilterQuery($filterQuery);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The faceting is solved (see other commit suggest).

With selecting a facet, we get the (expected) exception:

"A filterquery must have a key value" in Line 283.

The used GET-parameter are:

tx_dlf[fq][0]=collection_faceting:("Projekt: Illustrierte Magazine der Klassischen Moderne")&tx_dlf[fq][1]=type_faceting:("volume")&tx_dlf[query]=*

So please rewrite Line 283.

Classes/Plugin/Eid/SearchInDocument.php Show resolved Hide resolved
@beatrycze-volk
Copy link
Collaborator Author

With selecting a facet, we get the (expected) exception:

"A filterquery must have a key value" in Line 283.

Is $key not a key value?

@albig
Copy link
Collaborator

albig commented Aug 19, 2021

I think we finally reached the state, all features seems to work and I can have a look at the performance of the new ocr highlighting plugin and about the "migration" path from last release to next release containing this feature. So maybe, there is coming another change request - or the final merge in new future.

@albig
Copy link
Collaborator

albig commented Sep 6, 2021

Just to give an update.

Indexing with the same set of documents seems to work with this PR. But there are missing some pages. Obviously the ocrhighlight plugin doesn't like pages with the <Illustration> tag.

Example:
https://digital.slub-dresden.de/data/kitodo/lebe_364572701-19320800/lebe_364572701-19320800_ocr/00000068.xml

at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: [unknown])
	at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:109)
	at de.digitalcollections.solrocr.formats.alto.AltoParser.(AltoParser.java:27)
	at de.digitalcollections.solrocr.formats.alto.AltoFormat.getParser(AltoFormat.java:38)
	at de.digitalcollections.solrocr.model.OcrFormat.filter(OcrFormat.java:90)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:51)

The other point is, that the index size is 3 times (!) bigger than before.

@albig
Copy link
Collaborator

albig commented Sep 20, 2021

Just to give an update.

I'm again on this PR. Just a new update.

Indexing with the same set of documents seems to work with this PR. But there are missing some pages. Obviously the ocrhighlight plugin doesn't like pages with the <Illustration> tag.

It has been to very long <Polygon> element. See dbmdz/solr-ocrhighlighting#210. This has been fixed.

The other point is, that the index size is 3 times (!) bigger than before.

Storing the whole ALTO is not that nice. This PR ignores additionally the ALTO namespace. Not sure, if we should go this way.

If we store only the suggested miniOcr-format, the index has "only" double size than before.

@beatrycze-volk
Copy link
Collaborator Author

Just to give an update.

I'm again on this PR. Just a new update.

Indexing with the same set of documents seems to work with this PR. But there are missing some pages. Obviously the ocrhighlight plugin doesn't like pages with the <Illustration> tag.

It has been to very long <Polygon> element. See dbmdz/solr-ocrhighlighting#210. This has been fixed.

I was actually also on this PR. Right after push my 'fix' commit I have seen this comment. Probably I will need to delete it.

The other point is, that the index size is 3 times (!) bigger than before.

Storing the whole ALTO is not that nice. This PR ignores additionally the ALTO namespace. Not sure, if we should go this way.

If we store only the suggested miniOcr-format, the index has "only" double size than before.

Ok. Would it be our approach now to convert files to miniOCR? Should it be done in our code or by some external tool?

@albig
Copy link
Collaborator

albig commented Sep 20, 2021

I was actually also on this PR. Right after push my 'fix' commit I have seen this comment. Probably I will need to delete it.

Yes, I think so. This should'nt be fixed on our side.

Ok. Would it be our approach now to convert files to miniOCR? Should it be done in our code or by some external tool?

I've done this in our code because this miniOCR is no standard format.

I'd suggest the following procedure to continue:

  1. I merge Use Solr OCR Highlighting Plugin in Search in Document Plugin #587 and Remove obsolete hl parameter use in image highlighting #639 into a new feature branch add-ocrhighlighting-plugin
  2. I'll add my prototype code for miniOCR to this branch
  3. We can discuss the best solution in a new PR add-ocrhighlighting-plugin -> master

@albig albig changed the base branch from master to ocr-highlighting-plugin September 21, 2021 09:07
@albig albig merged commit 9432644 into kitodo:ocr-highlighting-plugin Sep 21, 2021
@beatrycze-volk
Copy link
Collaborator Author

  1. I merge Use Solr OCR Highlighting Plugin in Search in Document Plugin #587 and Remove obsolete hl parameter use in image highlighting #639 into a new feature branch add-ocrhighlighting-plugin
  2. I'll add my prototype code for miniOCR to this branch
  3. We can discuss the best solution in a new PR add-ocrhighlighting-plugin -> master

Ok!

@beatrycze-volk beatrycze-volk deleted the use-solr-highlighting branch September 21, 2021 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚙ feature A new feature or enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants