-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split data and completion caches, change APIs to use citekeys #627
Comments
OK @aikrahguzar and @roshanshariff - after experimenting with this over the weekend, I've closed the two related draft PRs, and opened this issue, where I summarize what I learned. If any comments, let's put them here. cc @localauthor |
My general opinion is that the best cache is not cache at all and so caching needs to be justified by performance and it is justified here. But even once we have caching it should not be used pervasively and only very few functions should directly access the cache. I think we would have had bugs due to functions that were needed to build cache were circularly trying to access cache. I agree with |
What do you mean by "directly access"? In the above strawman code,
Yes, I've run into that a few times, including when initially adding the cache. It also bit me experimenting on this branch. It's annoying.
That's as currently, but it requires (maybe aesthetically?) awkward code like: (citar--get-value (citar--get-entry "test") "title") Users and developers just naturally think in terms of citekeys. But that's a secondary consideration to other more substantive issues certainly. But ....
This indeed is another option. In effect, that would replace the parebib keys with the completion strings.
That's definitely the thing most performance-sensitive and data-dependent. I was going to say also the Embark stuff, but that's not true. BTW, the issue with the "Embark stuff" is broader than Embark; references in document buffers are in fact citekeys. So the same issue would apply with hydra or transient menus that are operating at point, or other functions which work with buffer citations. And, of course, the bibliographic formats we support (including the CSL JSON) all identify references with such keys. Which is why the parsebib hashes use them as the hash table keys. It's just that the completion case is the reverse: we don't necessarily remember the citekey, so we use the entry data to create a string to allow us to quickly find it. |
Yes, by direct access I mean that access to cache should not be in their code path and rather these functions should be supplied arguments that come from cache even if is more verbose. But maybe that concern can be mitigated just by renaming it to
I actually think of authors, and titles mostly, and not having to deal with citekeys is why I use
This ties back to using citar to avoid dealing with citekeys but for that it needs to be able to translate to (for inserting keys) and from (to tell me what the key actually represents). Selection needs the former while at point actions need the latter but for me there should be just these two functions which make use of the cache and then used as arguments of other functions. But doing this cleanly is difficult because there is no analogue of By the way, I don't think looking up entries is performance critical at all since I don't see any scenario where more than say 10 entries need to be looked up simultaneously, but it should still be fast if that is possible with a small effort. |
It's not in the current code, where as you say we mostly pass the entries as arguments through so lookup is not needed. It would be in the strawman though, since we'd need to get the citekey to get the entry, including for formatting the candidates? |
I don't know if this demonstrates anything; I don't know much about caching. But if I eval this in a buffer, it works as expected. Under what condition(s) would this fail (other than an empty data cache)? (require 'parsebib)
(require 'citar)
;; Internal variables
;; This design is adapted from org-mode 'oc-basic'.
(defvar citarn--data-cache nil
"Cache for parsed bibliography files.
This is an association list following the pattern:
(FILE-ID . ENTRIES)
FILE-ID is a cons cell (FILE . HASH), with FILE being the absolute file name of
the bibliography file, and HASH a hash of its contents.
ENTRIES is a hash table with citation references as keys and fields alist as
values.")
(defvar citarn--completion-cache nil
"Hash with key as completion string, value as citekey.")
;; Internal functions
(defun citarn--create-data-cache (files)
"Create data cache for FILES."
(seq-map
(lambda (file)
(let* ((path (file-truename file))
(checksum
(split-string
(call-process "md5sum" nil t nil path) " "))) ; TODO doesn't work correctly
(cons
(cons file checksum) ; get the checksum here
(parsebib-parse path))))
files))
(defun citarn-ref-read-make-candidates (hashes)
"Return a hash with completion candidates from HASHES."
(let ((chash (make-hash-table :test #'equal)))
(dolist (hash hashes)
(maphash
(lambda (citekey _entry)
(puthash (citarn-get-value citekey "title") citekey chash))
hash))
;; return new hash
chash))
(defun citarn--get-candidates ()
"Return completion hash."
(citarn-refresh)
citarn--completion-cache)
(defun citarn--get-data ()
"Return hashes."
(remq 'nil
(list
citarn--data-cache)))
(defun citarn-refresh ()
"Load caches."
(interactive)
(unless citarn--data-cache
(setq citarn--data-cache ; load this first
(parsebib-parse citar-bibliography)))
(unless citarn--completion-cache
(setq citarn--completion-cache
(citarn-ref-read-make-candidates
(citarn--get-data)))))
(defun citarn-select-ref ()
"Select reference, return citekey."
(let* ((candidates (citarn--get-candidates))
(choice (completing-read "Ref: " candidates)))
(gethash choice candidates)))
(defun citarn-get-entry (key)
"Return entry alist for KEY."
(seq-some
(lambda (ht)
(gethash key ht))
(citarn--get-data)))
(defun citarn-get-value (key-or-entry field)
"Retern FIELD value for KEY-OR-ENTRY."
(let ((entry (if (stringp key-or-entry)
(citarn-get-entry key-or-entry)
key-or-entry)))
(cdr (assoc-string field entry))))
;; Interactive commands
(defun citarn-example ()
"Return title as message."
(interactive)
(let* ((choice (citarn-select-ref))
(title (citarn-get-value choice "title")))
(message title)))
(citarn-example)
</details> |
A few observations about the org-cite It's well worth looking at that code for ideas. Details: First, I was just reminded of this explanation from the org-cite author on how he handled cache invalidation in oc-basic.
It looks like this: ((("/home/bruce/test.bib" . "afc432cb9103e4319e1402c2beff5838")
. #<hash-table equal 778/1095 0x160dc63>)) So he has two hash caches; one for bibliography files, and the other for completion, each with the exact same structure I am proposing here. Another interesting wrinkle is there's a cache for the bib files: (defvar org-cite-basic--file-id-cache nil
"Hash table linking files to their hash.") ELISP> (gethash "~/org/bib/academic.bib" org-cite-basic--file-id-cache)
"0f1fa900095a0a6b581197510122cafdfc14a8a4" In the function that creates the completion table (similar to So here you see he's iterating through the cache alist and checking the hashes for the key. (defun org-cite-basic--get-entry (key &optional info)
"Return BibTeX entry for KEY, as an association list.
When non-nil, INFO is the export state, as a property list."
(catch :found
(pcase-dolist (`(,_ . ,entries) (org-cite-basic--parse-bibliography info))
(let ((entry (gethash key entries)))
(when entry (throw :found entry))))
nil)) Aside: while org-cite supports local and global bib files, it doesn't have separate caches for them. Finally, it's equivalent of |
See #628. Sorry for the frantic activity. I have some free time, and really wanted to figure this out! |
For a long time I've been thinking there shouldn't just be local and global caches, but instead there should be one cache per bib file, stored globally. This way, if multiple buffers share the same local bibliography, it is only parsed once and stored in one place. The cache invalidation can be done either using fsnotify or the hashing scheme you mentioned (though there are some downsides to the latter in the context of Citar that don't apply to org-export). I had some ideas on how to implement this cleanly (including evicting a bib file from the cache when it is no longer referenced by any open buffer or the global bibliography list). I can probably code it up in the next few days and submit a PR to the new branch. I also like your proposal of having two hash tables, one keyed by citekeys and the other keyed by "display strings". At this point, we're basically thinking of the cache as an in-memory database (with the bib entries as records) and we want to query the database by citekey (the primary identifier) as well as the completion string. So it makes sense to create those two "indices" into the database. EDIT: I was also thinking the bib file parsing (and possibly the candidate string generation) could be done in a background emacs process using emacs-async so that the UI doesn't hang for several seconds every time the bib file needs to be re-parsed. This should be fairly easy to implement with the per-file cache I'm proposing. |
Indeed; I was just thinking about this, and included a note about it on the PR. I think we should seriously consider that possibility, mostly because org-cite already supports both seamlessly, without having separate caches. But I don't use local bibs, so @aikrahguzar will have to test it ;-)
That would be cool! |
I on the other hand I am about to get busy so I will mostly disappear for a while after this comment. I think the general structure should work but I think from performance point of view both the current approach and new one are not fundamentally different from each other so probably only profiling can tell us which is better. From the perspective of writing code, I think there are two big points:
|
Great points @aikrahguzar; I agree on both. I have a feeling we could reduce LOC a reasonable amount, but that will be one good benchmark. |
I must say I agree with @aikrahguzar on this. I'm on board with restructuring the structure of the cache as you've suggested, @bdarcus, but I don't like the idea of functions throughout citar implicitly accessing the cache whenever they need to. Instead, the cache should be exposed through a small number of functions, and everything else should be supplied arguments with all the information needed to do the job (i.e. the citekey and/or entry). This approach follows the principle of minimizing global state and cross-coupling of different components. On the surface it might be a little bit more verbose, but I think on the whole it's better to call a function that fetches data from the cache and then pass that data along to other functions. Moreover, you'll often be accessing many pieces of data about an entry: (citar-get-value "field1" "citekey")
(citar-get-value "field2" "citekey") Instead of writing code in this style, which has the overhead of repeatedly accessing the cache, it's not much worse to do this (let ((entry (citar-get-entry "citekey")))
(citar-get-value "field1" entry)
(citar-get-value "field2" entry)) If you're coming from an object-oriented language, think of a method like |
The initial PR code allows both, following I think we can decide later if we continue to allow that and document that issue, or we restrict it again to only accept the entry as arg. Even there, however, we'd want to document. This isn't any better, and is arguably worse. (citar-get-value "field1" (citar-get-entry "citekey"))
(citar-get-value "field2" (citar-get-entry "citekey")) |
@aikrahguzar before you get real busy elsewhere, anything in particular to look out for on the embark and multi-select end? |
PS - it may be that oc-basic doesn't need a local and global cache because it has the cache for the files? So can reparse depending? EDIT: actually, this is what ELISP> (citarn--parse-bibliography)
(("~/org/bib/academic.bib" . #<hash-table equal 778/1095 0x822025>)) So that should be helpful. |
I don't think so, they should work as long as single selection is working. Multiple selection use the |
I got it working, and returning a list of keys regardless of multiple or not. ELISP> (citar-select-refs)
("naim2005" "huck1977")
ELISP> (citar-select-ref)
("naim2005") That's what we want; right? It wouldn't make sense to have a string or a list? The primary issue I ran into was |
Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix #623 Close #627
Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix #623 Close #627
Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix #623 Close #627
Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix emacs-citar#623 Close emacs-citar#627
Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix emacs-citar#623 Close emacs-citar#627
Now that I merged the changes, to go back to this, @aikrahguzar:
There is a small net increase in lines of code, but functionality is increased; I'd say significantly. That, plus a cleaner and more flexible API, made this worth the breaking change in the end. |
I went down a rabbit hole with some code experiments in #625 and #626, but maybe it's better to just talk about it generally, to decide if there's any value in the idea, or to reject it.
What I'm proposing here, I realized after opening this, is exactly the same implementation as in org-mode
oc-basic
. See below: #627 (comment).There's some demo code at the end of this.
Basic Idea
The
(candidate . (key . entry))
candidate structure is complex, and difficult to work with. There were good reasons for it at the time, but perhaps that was only necessary because we used the same cache for both candidates and entry data?Aside from any performance considerations (which I actually don't think are significant; see below), am I missing something obvious, or would it not result in cleaner, easier to understand and debug code, if we got rid of the current candidate structure, and only use citekeys and hash tables to access everything?
The other huge advantage of this, I learned subsequently, is customizable hash table equality testing; code like this to invalidate the cache:
A simple diagram of the hash structures:
The Problem
This would break code throughout. A quick ripgrep shows 61 instances of "key-entry" in the repo, for example. And I haven't found it easy to adjust that code; which maybe supports my point about it being too complex (mostly because of the candidate structure)?
So it might be possible to make the code simpler, cleaner and easier to maintain in the end, with a more intuitive API, but to get there may be a PITA.
On Performance
I started out playing with this because I thought this could improve performance, and potentially solve a relatively minor issue in #623.
I no longer think this would improve performance overall, for two reasons:
Rather, I think the improved and scalable entry access speed of the hashes makes it possible for us to simplify the code and API without losing performance.
Benchmark examples with ~1000 entries:
Finally, to confirm @aikrahguzar's profiling, parsebib is reasonably fast; what's slow is creating the candidate strings (and per my point above, there's no real difference between the hash and list versions there).
Demo Code
Here is demo code you can
eval-buffer
on:The text was updated successfully, but these errors were encountered: