Refactor cache architecture #628

bdarcus · 2022-06-07T18:31:59Z

Convert the cache from a single list, to a list of structured objects, each of which contains two hash tables, each keyed by citekey:

:entries the raw bib data
:completions the formatted completion strings

The completion candidates themselves are now no longer cached, but quickly assembled on-the-fly.

The bibliography objects also include the following additional properties, to selectively update bibliographic data:

:hash the checksum of the file
:buffers the associated buffers

Close #627, fix #623 #610

Status

Roshan's #634 is an improvement on this (the text above reflects it actually), so closing this in favor of that.

Caching and completion all work, as do at least some of the interactive commands (I haven't tested them all; some may break currently because completion now always returns a list of keys).

I've added keyword comments in places.

Questions and notes

Can we avoid reintroducing a separate local cache? I think we can. See citar--parse-bibliography (and also org-cite-list-bibliography-files).
Can we remove some of the inotify code, since this should obsolete the part that deals with bib files? If yes, then we can also remove citar-refresh etc, which I have done for now (though adding it back would basically be a wrapper for (clrhash citar--completion-cache).
Can we resolve Optimize and generalize has-notes #623 along with this? (I think I did.)
Do we need "rebuild cache" now (see comments from its original author)? I assumed no, so removed it. What about ensure-entries?
Should we wrap the completion cache in citar--get-candidates, or deprecate that function (as now)?
Where else can we simplify, and remove code?
citar-select-ref now always returns a list. Is that a mistake? My key concern is avoiding bugs in Embark and such.
citar--ref-completion-table is conceptually the same as citar--get-candidates; should we rename it citar--get-ref-completions to be consistent?

bdarcus · 2022-06-07T22:46:09Z

@roshanshariff - FYI, I have a status section at the top of the PR, where I will briefly update where things stand as I make changes.

I moved everything from citarn.el to citar.el, but have left the citarn.el file there in case it's useful for to use experimentally.

~~The changes to citar.el actually didn't shrink the LOC. But hopefully we can trim elsewhere.~~ Actually, there is a net reduction of maybe 100 lines so far, if you exclude citarn.el.

bdarcus · 2022-06-08T01:04:10Z

ELISP> (benchmark-run-compiled 1 (citar--ref-completion-table))
(0.597032375 1 0.1635932209999993)

bdarcus · 2022-06-12T14:50:39Z

I actually think we can add the notes/files data fast enough without a cache, which would simplify matters greatly. I think that should be our first preference unless there's a compelling feature we can't get without a cache.

If we're just talking about #628 (comment), I guess that would solve the problem I raised there.

But creating the completion candidates is still the slowest operation, and turning off the completion cache is noticeable, even with my not very large library.

The completion cache is also brain-dead simple, so the cost seems low.

So I'll be curious what you find in your experiments.

Granted, keeping the caching would require a way to invalidate the cache from outside. But as I think about that, maybe its trivial?

roshanshariff · 2022-06-12T15:01:17Z

Even if we can invalidate a cache from outside, it's still inefficient to regenerate the cache repeatedly for every buffer every time any metadata changes. A cache only goes so far, and if it's invalidated too often you'll lose the benefits of caching.

Currently, generating the completion candidates is doing a lot of unnecessary work in the template expansion. I think it can be boiled down to just concatenating a few strings, which can in fact be done as-needed. The trick is to do most of the string formatting ahead of time per-bibliography, and only redo it whenever the bibliography changes.

Then the only work that needs to be done to generate the completion candidates is to gather the list of bibliographies, get their associated formatted strings, concatenate them (possibly adjusting for the frame width), and add the has-note/has-file metadata. This last step (I believe) can be done very quickly.

bdarcus · 2022-06-12T15:06:11Z

OIC.

I guess this also goes back to the earlier discussion of as-needed display formatting, which completing-read per doesn't support yet, but its group-function does.

Edit:

In the interim, I guess you're thinking, we can just stash the primary completion string in an entry field while parsing the bibliography?

roshanshariff · 2022-06-12T15:17:38Z

I don't think it's a matter of just completing-read supporting as-needed formatting. You pretty much need all the strings in memory if you're going to search them. You may be displaying only a few strings at a time, but as soon as you enter a single character/tab-complete you need to search all the strings in the completion table.

The only thing you can do at display-time is to add non-searchable but visible metadata (like Marginalia does). This doesn't help Citar much because we pretty much want the whole completion string to be searchable, so we must necessarily generate all the completion strings every time we call completing-read. The only thing that is "ornamental" is the column-width truncation that hides text when it's too long to fit on one line, and we can just do that with the completion strings when we generate them (they won't reformat if you resize the Emacs frame while completion is running, but that's a small tradeoff).

In the interim, I guess you're thinking, we can just stash the primary completion string in an entry field while parsing the bibliography?

I was thinking of another hash table in the cache with the display strings, just to avoid messing with the entry too much. Otherwise, it gets hard to keep track of all the non-standard bibliography fields we're adding and they become a de facto part of our public API, which means we can't change implementation details without breaking external code.

bdarcus · 2022-06-12T16:00:10Z

OIC; so basically moving something like the hash (minus the "has" stuff, and with the key-value structure inverted) that is now the completion cache into the bibliography cache?

Makes sense in any case. I get the basic idea, and agree it likely makes sense.

The end result is one cache, that is simple from an API standpoint, but with great functionality, and very snappy performance.

The downside, I'd guess, is the internals of that cache may be complex.

It also occurs to me it'd be best optimized, if one has a large number of entries, for having one large file, that doesn't change much, and maybe one or two that do.

Say a main file, and a "new" file.

Is that right?

roshanshariff · 2022-06-12T22:52:16Z

Yeah, exactly. The structure of the cache is actually not that complex, and even that complexity is localized in a few functions.

And yes, it is optimized for that use case, but there's no disadvantage to having a large number of small bibliographies either. It's just that all those bibliographies will be stored in a central cache, but fetching things from that cache is extremely efficient so it doesn't really matter.

bdarcus · 2022-06-12T23:33:29Z

The structure of the cache is actually not that complex ...

I was realizing after I wrote that something like this would probably work to represent a file, and it's indeed simple.

("/home/test/bib/test.bib" 
   :data #<hash-table equal 1/65 0x1b5b1e9> 
   :completions #<hash-table equal 0/65 0x208978f> 
   :checksm "1234" 
   :buffers nil)

roshanshariff · 2022-06-12T23:35:47Z

Yeah, that's pretty much what I have now, but defined using a cl-defstruct so you get accessor functions, documentation, etc.

(cl-defstruct (citar--bibliography
               (:constructor citar--make-bibliography (filename))
               (:copier nil))
  "Cached bibliography file."
  (filename
   nil
   :read-only t
   :documentation
   "True filename of a bibliography, as returned by `file-truename`.")
  (hash
   nil
   :documentation
   "Hash of the file's contents, as returned by `buffer-hash`.")
  (buffers
   nil
   :documentation
   "List of buffers that require this bibliography.")
  (entries
   (make-hash-table :test 'equal)
   :documentation
   "Hash table mapping citation keys to bibliography entries,
   as returned by `parsebib-parse`.")
  (formatted
   (make-hash-table :test 'equal)
   :documentation
   "Formatted strings used to display bibliography entries."))

bdarcus · 2022-06-12T23:40:34Z

but defined using a cl-defstruct so you get accessor functions, documentation, etc.

Oh, good; hadn't thought of that, but I like it a lot!

I had just run into cl-defstruct for the first time recently when working with the org-roam codebase.

bdarcus · 2022-06-12T23:54:51Z

This is a totally minor thing, but do you know if there's a way to get it so that marginalia actually shows the documentation strings in the annotations?

Is it worth asking Daniel and Omar about? I don't see anything about it in that issue tracker.

roshanshariff · 2022-06-13T00:00:27Z

It looks like cl-defstruct always makes the first line of the documentation be a generic string like what you see in that screenshot. The :documentation string you supply ends up on the next line, and you can see it if you use describe-function etc. It's slightly annoying, but I think you'd need to modify cl-defstruct itself to fix it. The first line of the documentation string is treated specially throughout Emacs, so I doubt it would be easy/desirable to hack around that in Marginalia.

bdarcus · 2022-06-13T00:10:48Z

Ah yeah; that makes sense.

BTW, another minor thing we can control ...

As I'm reading the cl-defstruct docs, we may as well include a :type property on each definition?

roshanshariff · 2022-06-13T00:12:16Z

I was looking at that, and it looks like that's mainly for compatibility with Common Lisp? It's ignored in Elisp, and in fact there isn't really a good vocabulary for naming types in Emacs.

bdarcus · 2022-06-13T02:07:26Z

This has some good tips on cl-defstruct etc:

https://nullprogram.com/blog/2018/02/14/

He recommends, and I think I agree, for a constructor name like citar--bibliography-create.

I also think completions may be more clear than formatted (which may be misleading).

 (cl-defstruct (citar--bibliography
        (:constructor citar--bibliography-create (filename))
               (:copier nil))
  "Cached bibliography file."
  (filename
   nil
   :read-only t
   :documentation
   "True filename of a bibliography, as returned by `file-truename`.")
  (hash
   nil
   :documentation
   "Hash of the file's contents, as returned by `buffer-hash`.")
  (buffers
   nil
   :documentation
   "List of buffers that require this bibliography.")
  (entries
   (make-hash-table :test 'equal)
   :documentation
   "Hash table mapping citation keys to bibliography entries,
   as returned by `parsebib-parse`.")
  (completions
   (make-hash-table :test 'equal)
   :documentation
   "Formatted completion strings used to display bibliography entries."))

Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix #623 Close #627

This functions returns all local and global bibliography files for 'citar--parse-bibliography' to parse.

Allows to independently turn off whether to do this by default, and whether to toggle the behavior.

bdarcus · 2022-06-13T12:40:52Z

FWIW, I just pushed a change here (rebased from main), but it was only adding the new citar-capf.el file I just merged to main. We'll need to adjust that to this branch.

Vidianos-Giannitsis · 2022-06-13T16:21:02Z

From my perspective, this idea seems pretty good. If I understand this correctly, this makes bibliography into a cl-defstruct which should make it much easier to then access each separate object of it. The org-roam-node which is defined this way has a lot of supporting functions (I think they are called methods in CL terminology) which allow you to access each component separately and I think it works very well and is extensible by the user. For example I have defined a method that accesses the buffer of an org-roam-node for use in zetteldesk. Making the citar bibliography like this would then make it much easier to access parts of it I feel.

bdarcus · 2022-06-13T16:28:05Z

Thanks for taking a look!

If I understand this correctly, this makes bibliography into a cl-defstruct which should make it much easier to then access each separate object of it.

It makes the cache a list of such objects, so we have finer-grained control of UI updating, and also, as a consequence the "has-note" indicators in the UI are always up-to-date with the bibliography files (edit: AND possible external note systems like org-roam).

The higher level functions (like citar--get-entry) will mostly stay the same.

Make sense?

Vidianos-Giannitsis · 2022-06-13T16:31:23Z

Yeah I get it. This definitely sounds like a step in the right direction. And by not changing the higher level functions it also doesn't break anybody's workflow which is a good thing (unless they use the low level functions at which point they can figure out what to change).

bdarcus · 2022-06-13T19:24:33Z

@roshanshariff any ETA on when you can open up at least a draft PR?

Even if not all functions are working ATM, I can still weigh on what they are, etc.

roshanshariff · 2022-06-13T19:26:16Z

ETA on when you can open up at least a draft PR?

Sure, I can do that in a couple of hours. Pretty much all the code is ready and tested, and the performance looks promising. Just need to do a bit of work to hook it up to the existing citar functions so you can use it.

bdarcus · 2022-06-14T00:30:00Z

Closing in favor of #634

bdarcus changed the title ~~adding the citarn minimal experiment~~ Split cache, API change, round 3 Jun 7, 2022

bdarcus force-pushed the cache--api-refactor branch from 2347077 to 6954d65 Compare June 7, 2022 18:45

bdarcus mentioned this pull request Jun 7, 2022

Split data and completion caches, change APIs to use citekeys #627

Closed

bdarcus force-pushed the cache--api-refactor branch 2 times, most recently from 92c8f69 to ac00995 Compare June 7, 2022 22:44

bdarcus added the enhancement New feature or request label Jun 7, 2022

bdarcus added this to the v1.0 milestone Jun 7, 2022

bdarcus force-pushed the cache--api-refactor branch 7 times, most recently from 5ac5098 to ca52191 Compare June 8, 2022 00:57

bdarcus force-pushed the cache--api-refactor branch 2 times, most recently from 787ec6d to 1d6b332 Compare June 8, 2022 01:17

bdarcus mentioned this pull request Jun 8, 2022

support for Citar Vidianos-Giannitsis/zetteldesk.el#5

Closed

bdarcus force-pushed the cache--api-refactor branch from 1d6b332 to 477f2b3 Compare June 8, 2022 02:26

This was referenced Jun 8, 2022

config example not setting capf correctly mclear-tools/citar-capf#2

Closed

Add prefix argument to force reloading of the cache #122

Merged

bdarcus force-pushed the cache--api-refactor branch 4 times, most recently from d04e44a to 1b6cbc1 Compare June 8, 2022 13:37

bdarcus mentioned this pull request Jun 8, 2022

Why not make this part of Citar itself? mclear-tools/citar-capf#3

Closed

bdarcus force-pushed the cache--api-refactor branch 3 times, most recently from 120f866 to 7afafc4 Compare June 8, 2022 20:18

bdarcus added 3 commits June 13, 2022 08:24

Add citar--bibliography-files

61e385a

This functions returns all local and global bibliography files for 'citar--parse-bibliography' to parse.

Add citar-prefilter-entries

c2a4855

Allows to independently turn off whether to do this by default, and whether to toggle the behavior.

bdarcus force-pushed the cache--api-refactor branch from a1d0085 to c2a4855 Compare June 13, 2022 12:24

bdarcus changed the title ~~Split cache, API change, round 3~~ Refactor the cache architecture Jun 13, 2022

bdarcus changed the title ~~Refactor the cache architecture~~ Refactor cache architecture Jun 13, 2022

roshanshariff mentioned this pull request Jun 13, 2022

New caching and formatting functionality #634

Merged

6 tasks

bdarcus closed this Jun 14, 2022

bdarcus deleted the cache--api-refactor branch August 8, 2022 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cache architecture #628

Refactor cache architecture #628

bdarcus commented Jun 7, 2022 •

edited

Loading

bdarcus commented Jun 7, 2022 •

edited

Loading

bdarcus commented Jun 8, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 •

edited

Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 •

edited

Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 13, 2022

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 13, 2022 •

edited

Loading

bdarcus commented Jun 13, 2022 •

edited

Loading

Vidianos-Giannitsis commented Jun 13, 2022

bdarcus commented Jun 13, 2022 •

edited

Loading

Vidianos-Giannitsis commented Jun 13, 2022

bdarcus commented Jun 13, 2022

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 14, 2022

Refactor cache architecture #628

Refactor cache architecture #628

Conversation

bdarcus commented Jun 7, 2022 • edited Loading

Status

Questions and notes

bdarcus commented Jun 7, 2022 • edited Loading

bdarcus commented Jun 8, 2022 • edited Loading

bdarcus commented Jun 12, 2022 • edited Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 • edited Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 • edited Loading

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022

roshanshariff commented Jun 12, 2022

bdarcus commented Jun 12, 2022 • edited Loading

bdarcus commented Jun 12, 2022 • edited Loading

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 13, 2022

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 13, 2022 • edited Loading

bdarcus commented Jun 13, 2022 • edited Loading

Vidianos-Giannitsis commented Jun 13, 2022

bdarcus commented Jun 13, 2022 • edited Loading

Vidianos-Giannitsis commented Jun 13, 2022

bdarcus commented Jun 13, 2022

roshanshariff commented Jun 13, 2022

bdarcus commented Jun 14, 2022

bdarcus commented Jun 7, 2022 •

edited

Loading

bdarcus commented Jun 7, 2022 •

edited

Loading

bdarcus commented Jun 8, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 12, 2022 •

edited

Loading

bdarcus commented Jun 13, 2022 •

edited

Loading

bdarcus commented Jun 13, 2022 •

edited

Loading

bdarcus commented Jun 13, 2022 •

edited

Loading