-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Solr OCR Highlighting Plugin in Search in Document Plugin #587
Use Solr OCR Highlighting Plugin in Search in Document Plugin #587
Conversation
What do I have to do to install/activate the OCR Highlighting Plugin? Does the code work, if it's not installed? |
https://dbmdz.github.io/solr-ocrhighlighting/installation/ I would assume that it doesn't work if not installed. I have only tested with SOLR index which had the plugin installed. |
7964285
to
c923341
Compare
Ok. Than we have to include this information to the documentation. And I will setup a testsystem with this Solr plugin. May take some days. |
a32763a
to
c924cd1
Compare
c924cd1
to
e236489
Compare
Should I mark it as ready to merge or we also decide that some changes can't be merged in and I will create new branch in which I will implement new stuff without changing old ones (I mean way of submitting the form and cross origin access)? |
0498523
to
97ebdaa
Compare
You are right. It doesn't work out-of-the box. The search plugin still finds no snippets of course.The searchInDocument fails with exception. I still don't like the idea to make the solr-ocrhighlighting plugin obligatory. And I will investigate howto upgrade an existing installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't manage to get the Solr working with
- Solr 8.8
- solr-ocrhighlighting 0.5.0 (current release)
- schema.xml and solrconfig.xml from this PR
- updated extension configuration for solr (using default values)
On indexing a document, I end up with the following error
org.apache.solr.common.SolrException: Exception writing document id 36LOG_0003 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:256) at org.apache.solr.update.processor.RunUpdateProcessorFactory$RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:73) ... Caused by: java.lang.RuntimeException: Could not parse pointer: at de.digitalcollections.solrocr.model.SourcePointer.parse(SourcePointer.java:81) at de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory.create(ExternalUtf8ContentFilterFactory.java:42) at org.apache.solr.analysis.TokenizerChain.initReader(TokenizerChain.java:97) ...
Please check again and tell me your system setup to reproduce.
200fe81
to
a4f5373
Compare
{ | ||
$id = $this->doc->uid; | ||
|
||
if (!empty($this->conf['documentIdUrlSchema'])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this? Could you please provide an example string to verify the string splitting magic below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes id of the document is actually not an id but a link to xml file. Then to search in the index the full link can't be used so it is needed to extract real id out of the full link.
Example:
https://dev-ddb.fiz-karlsruhe.de/api/items/F45KF24U7UIIVCUJJBV334PD7IPE7JBW/source/record -> https://dev-ddb.fiz-karlsruhe.de/api/items/*id*/source/record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to distinguish more cases here:
- The normal case would be documents which are properly indexed in the TYPO3 database. Those don't need any special treatment for their ID anyway.
- The "DFG-Viewer" case would be documents which are not indexed in the database nor in the Solr index. Normally those can be identified by the fact that they have an URL as ID.
- Then there is the special "Zeitungsportal" case where we have documents, which are not indexed in the database, but are indexed in the Solr index. Those have an URL as ID as well, but need special treatment to extract the indexed ID from the URL.
I suggest using the TYPO3 hook mechanism to handle all those cases:
- If a document has a numerical ID, no further treatment is necessary.
- If a document has a string ID, this string is processed through a hook.
- If the hook returns an integer, the document is treated as indexed.
- If the hook returns the string unchanged, the document is considered unindexed and ignored when searching. (This would be the default if no hook is registered, e. g. in "DFG-Viewer mode".)
The advantages would be:
- We don't need to add Zeitungportal-specific logic to the main extension. Instead the hook can be part of the supplemental Zeitungsportal extension.
- The solution is extensible, because you can register more than one hook to handle different kinds of identifiers. So anybody with a setup of a pre-existing Solr index could map their document IDs to the index IDs and re-use their Solr index. (This could even be interesting for SLUB, when we think about consolidating our different indices - CMR, Find, Kitodo, ...)
- Processing the ID string wouldn't be limited to splitting a string, because a hook could include more complex logic like regular expressions or even database lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using the TYPO3 hook mechanism to handle all those cases:
Do you mean https://typo3.org/article/how-to-use-existing-hooks-in-your-own-extension ?
If a document has a string ID, this string is processed through a hook.
If the hook returns an integer, the document is treated as indexed.
If the hook returns the string unchanged, the document is considered unindexed and ignored when searching. (This would be the default if no hook is registered, e. g. in "DFG-Viewer mode".)
But how to convert string ID to an integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually already have some hooks in Kitodo.Presentation. A good example is the way how we extract the record ID from METS:
Normally we would take /mets:mets[@OBJID]
as the document's unique record identifier:
kitodo-presentation/Classes/Common/MetsDocument.php
Lines 147 to 150 in f12eede
// Check for METS object @ID. | |
if (!empty($this->mets['OBJID'])) { | |
$this->recordId = (string) $this->mets['OBJID']; | |
} |
But if this attribute is not set (which is the case for old versions of Kitodo.Production), we use hooks to have a fallback. The hook is registered in
ext_localconf.php
: kitodo-presentation/ext_localconf.php
Line 219 in f12eede
$GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['dlf/Classes/Common/MetsDocument.php']['hookClass'][] = \Kitodo\Dlf\Hooks\KitodoProductionHacks::class; |
Classes/Common/MetsDocument.php
: kitodo-presentation/Classes/Common/MetsDocument.php
Lines 151 to 158 in f12eede
// Get hook objects. | |
$hookObjects = Helper::getHookObjects('Classes/Common/MetsDocument.php'); | |
// Apply hooks. | |
foreach ($hookObjects as $hookObj) { | |
if (method_exists($hookObj, 'construct_postProcessRecordId')) { | |
$hookObj->construct_postProcessRecordId($this->xml, $this->recordId); | |
} | |
} |
The hook itself can be found in
Classes/Hooks/KitodoProductionHacks.php
: kitodo-presentation/Classes/Hooks/KitodoProductionHacks.php
Lines 37 to 68 in f12eede
public function construct_postProcessRecordId(\SimpleXMLElement &$xml, &$record_id) | |
{ | |
if (!$record_id) { | |
$xml->registerXPathNamespace('mods', 'http://www.loc.gov/mods/v3'); | |
// Get all logical structure nodes with metadata, but without associated METS-Pointers. | |
if (($divs = $xml->xpath('//mets:structMap[@TYPE="LOGICAL"]//mets:div[@DMDID and not(./mets:mptr)]'))) { | |
$smLinks = $xml->xpath('//mets:structLink/mets:smLink'); | |
if (!empty($smLinks)) { | |
foreach ($smLinks as $smLink) { | |
$links[(string) $smLink->attributes('http://www.w3.org/1999/xlink')->from][] = (string) $smLink->attributes('http://www.w3.org/1999/xlink')->to; | |
} | |
foreach ($divs as $div) { | |
if (!empty($links[(string) $div['ID']])) { | |
$id = (string) $div['DMDID']; | |
break; | |
} | |
} | |
} | |
if (empty($id)) { | |
$id = (string) $divs[0]['DMDID']; | |
} | |
$dmdIds = explode(' ', $id); | |
foreach ($dmdIds as $dmdId) { | |
$recordIds = $xml->xpath('//mets:dmdSec[@ID="' . $dmdId . '"]//mods:mods/mods:recordInfo/mods:recordIdentifier'); | |
if (!empty($recordIds)) { | |
$record_id = (string) $recordIds[0]; | |
break; | |
} | |
} | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hook is a possibility. I think the modern way would be to make some "middleware". But I haven't done this before. Hook support itself seem to be not deprecated, although lots of core hooks are.
The DFG-Viewer case is not relevant in the SearchInDocument tool. This tool won't work with it anyway. The DFG-Viewer has it's own SRU search plugin, which requires a SRU server of course.
I suggest to first test and documentate this PR including indexing (doesn't work yet) and searching for Kitodo.Presentation alone. If this works, we can discuss further at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add at least a sample of the documentIdUrlSchema setting. Or is this part of the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be already part of #630. But I still need some (inline) documentation about what is going on here, why we are doing this and what string you expect in documentIdUrlSchema.
@albig: Is your error similar to something like that:
? After research I have found out that indexing when XML files are stored in data folder of the core works correctly but when they are passed from external URL the error appears. I have checked in the GItHub page of the plugin and found dbmdz/solr-ocrhighlighting#49:
|
45ebbb9
to
1c7c1f3
Compare
I'm sorry for this few rebase pushes, I have made a bit of mess and needed to fix it. Now all commits should be correct. |
Yes, I think this was the error I've seen. So we have to wait for the fix in the ocrhighlighting plugin? |
Yes, of course if they are going to fix it actually. That's not so sure after reading the conversation in this issue. Other possiblity is to store content of the xml files instead pointers to those files. |
fbe6380
to
14c4e9f
Compare
Currently yes. There is also possibility to implement it in the way that coordinates will be already passed in URL parameter.
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have look at my comments.
for(var i = 0; i < queryParams.length; i++) { | ||
var queryParam = queryParams[i].split('='); | ||
|
||
if(queryParam[0] === $("input[id='tx-dlf-search-in-document-page']").attr('name')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison doesn't respect the possible encoding of the parameters. This is wrong on other places, too.
If you click next/last page, the link parameter get's encoded:
/workview?tx_dlf%5Bdouble%5D=0&tx_dlf%5Bhighlight_word%5D=pieschen&tx_dlf%5Bid%5D=1&tx_dlf%5Bpage%5D=3&cHash=04b8a699974161fd87b83c9b36d65483
But 'tx_dlf%5Bpage%5D' === 'tx_dlf[page]'
is not true so the highlighting fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting seems to work now. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the patched ocr-highlighting plugin, I see the toplevel element. But there are some issues with the facets and the filter queries. Please see my comments.
foreach ($filterQueries as $key => $value) { | ||
if (isset($value['query'])) { | ||
$filterQuery[$key] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true'; | ||
$query->addFilterQuery($filterQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work either. But I found no way to test this. I think, there is another issue in Search:makeFacetsMenuArray()
. So I can't get the available facets from Solr and show the facets menu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken look to this method. Do you think that it needs to be adjusted now or it can be done later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addFilterQuery() above has to be fixed. But before doing this, I'm looking for working facets. Currently, the Solr has no facets available and I still don't know, why.
A query like
/solr/dlfCore0/select?facet=on&q=*%3A*&rows=0
Should return something like
{ "responseHeader":{ "status":0, "QTime":2, "params":{ "q":"*:*", "rows":"0", "facet":"on"}}, "response":{"numFound":4517,"start":0,"numFoundExact":true,"docs":[] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}}
In my case - with this PR working - I get
{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"*:*", "rows":"0", "facet":"on"}}, "response":{"numFound":4376,"start":0,"numFoundExact":true,"docs":[] }}
That's why no facetting is shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The faceting is solved (see other commit suggest).
With selecting a facet, we get the (expected) exception:
"A filterquery must have a key value" in Line 283.
The used GET-parameter are:
tx_dlf[fq][0]=collection_faceting:("Projekt: Illustrierte Magazine der Klassischen Moderne")&tx_dlf[fq][1]=type_faceting:("volume")&tx_dlf[query]=*
So please rewrite Line 283.
Co-authored-by: Alexander Bigga <[email protected]>
Co-authored-by: Alexander Bigga <[email protected]>
Co-authored-by: Alexander Bigga <[email protected]>
This line is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the last change request for now. Please have a look at the comments.
foreach ($filterQueries as $key => $value) { | ||
if (isset($value['query'])) { | ||
$filterQuery[$key] = $value['query'] . ' OR ' . $fields['toplevel'] . ':true'; | ||
$query->addFilterQuery($filterQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The faceting is solved (see other commit suggest).
With selecting a facet, we get the (expected) exception:
"A filterquery must have a key value" in Line 283.
The used GET-parameter are:
tx_dlf[fq][0]=collection_faceting:("Projekt: Illustrierte Magazine der Klassischen Moderne")&tx_dlf[fq][1]=type_faceting:("volume")&tx_dlf[query]=*
So please rewrite Line 283.
Co-authored-by: Alexander Bigga <[email protected]>
Co-authored-by: Alexander Bigga <[email protected]>
Is |
Co-authored-by: Alexander Bigga <[email protected]>
I think we finally reached the state, all features seems to work and I can have a look at the performance of the new ocr highlighting plugin and about the "migration" path from last release to next release containing this feature. So maybe, there is coming another change request - or the final merge in new future. |
Just to give an update. Indexing with the same set of documents seems to work with this PR. But there are missing some pages. Obviously the ocrhighlight plugin doesn't like pages with the at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: [unknown]) at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:109) at de.digitalcollections.solrocr.formats.alto.AltoParser.(AltoParser.java:27) at de.digitalcollections.solrocr.formats.alto.AltoFormat.getParser(AltoFormat.java:38) at de.digitalcollections.solrocr.model.OcrFormat.filter(OcrFormat.java:90) at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:51) The other point is, that the index size is 3 times (!) bigger than before. |
I'm again on this PR. Just a new update.
It has been to very long
Storing the whole ALTO is not that nice. This PR ignores additionally the ALTO namespace. Not sure, if we should go this way. If we store only the suggested miniOcr-format, the index has "only" double size than before. |
I was actually also on this PR. Right after push my 'fix' commit I have seen this comment. Probably I will need to delete it.
Ok. Would it be our approach now to convert files to miniOCR? Should it be done in our code or by some external tool? |
Yes, I think so. This should'nt be fixed on our side.
I've done this in our code because this miniOCR is no standard format. I'd suggest the following procedure to continue:
|
Ok! |
The main aim of this PR is to implement Solr OCR Highlighting Plugin inside the existing Search in Document Plugin.
Tasks:
Changes which should be discussed:
Currently in PageViewProxy it is set. It causes problem when this header is also configured in .htaccess.
For unknown reason submit form button with prevent page reload was not working for the partner, so it got changes to standard button which calls a method. It work for the partner now, but it has some implications for e.g. pressing Enter after writing search term simply submits form without executing search function. One of the possible solutions would be also map the key to the function, but it is not sure that there are no other implications not detected currently.