Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cache architecture #628

Closed
wants to merge 3 commits into from
Closed

Refactor cache architecture #628

wants to merge 3 commits into from

Conversation

bdarcus
Copy link
Contributor

@bdarcus bdarcus commented Jun 7, 2022

Convert the cache from a single list, to a list of structured objects, each of which contains two hash tables, each keyed by citekey:

  • :entries the raw bib data
  • :completions the formatted completion strings

The completion candidates themselves are now no longer cached, but quickly assembled on-the-fly.

The bibliography objects also include the following additional properties, to selectively update bibliographic data:

  • :hash the checksum of the file
  • :buffers the associated buffers

Close #627, fix #623 #610

Status

Roshan's #634 is an improvement on this (the text above reflects it actually), so closing this in favor of that.


Caching and completion all work, as do at least some of the interactive commands (I haven't tested them all; some may break currently because completion now always returns a list of keys).

image

I've added keyword comments in places.

Questions and notes

  1. Can we avoid reintroducing a separate local cache? I think we can. See citar--parse-bibliography (and also org-cite-list-bibliography-files).
  2. Can we remove some of the inotify code, since this should obsolete the part that deals with bib files? If yes, then we can also remove citar-refresh etc, which I have done for now (though adding it back would basically be a wrapper for (clrhash citar--completion-cache).
  3. Can we resolve Optimize and generalize has-notes #623 along with this? (I think I did.)
  4. Do we need "rebuild cache" now (see comments from its original author)? I assumed no, so removed it. What about ensure-entries?
  5. Should we wrap the completion cache in citar--get-candidates, or deprecate that function (as now)?
  6. Where else can we simplify, and remove code?
  7. citar-select-ref now always returns a list. Is that a mistake? My key concern is avoiding bugs in Embark and such.
  8. citar--ref-completion-table is conceptually the same as citar--get-candidates; should we rename it citar--get-ref-completions to be consistent?

@bdarcus bdarcus changed the title adding the citarn minimal experiment Split cache, API change, round 3 Jun 7, 2022
@bdarcus bdarcus force-pushed the cache--api-refactor branch from 2347077 to 6954d65 Compare June 7, 2022 18:45
@bdarcus bdarcus force-pushed the cache--api-refactor branch 2 times, most recently from 92c8f69 to ac00995 Compare June 7, 2022 22:44
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 7, 2022

@roshanshariff - FYI, I have a status section at the top of the PR, where I will briefly update where things stand as I make changes.

I moved everything from citarn.el to citar.el, but have left the citarn.el file there in case it's useful for to use experimentally.

The changes to citar.el actually didn't shrink the LOC. But hopefully we can trim elsewhere. Actually, there is a net reduction of maybe 100 lines so far, if you exclude citarn.el.

@bdarcus bdarcus added the enhancement New feature or request label Jun 7, 2022
@bdarcus bdarcus added this to the v1.0 milestone Jun 7, 2022
@bdarcus bdarcus force-pushed the cache--api-refactor branch 7 times, most recently from 5ac5098 to ca52191 Compare June 8, 2022 00:57
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 8, 2022

ELISP> (benchmark-run-compiled 1 (citar--ref-completion-table))
(0.597032375 1 0.1635932209999993)

@bdarcus bdarcus force-pushed the cache--api-refactor branch 2 times, most recently from 787ec6d to 1d6b332 Compare June 8, 2022 01:17
@bdarcus bdarcus force-pushed the cache--api-refactor branch from 1d6b332 to 477f2b3 Compare June 8, 2022 02:26
@bdarcus bdarcus force-pushed the cache--api-refactor branch 4 times, most recently from d04e44a to 1b6cbc1 Compare June 8, 2022 13:37
@bdarcus bdarcus force-pushed the cache--api-refactor branch 3 times, most recently from 120f866 to 7afafc4 Compare June 8, 2022 20:18
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

I actually think we can add the notes/files data fast enough without a cache, which would simplify matters greatly. I think that should be our first preference unless there's a compelling feature we can't get without a cache.

If we're just talking about #628 (comment), I guess that would solve the problem I raised there.

But creating the completion candidates is still the slowest operation, and turning off the completion cache is noticeable, even with my not very large library.

The completion cache is also brain-dead simple, so the cost seems low.

So I'll be curious what you find in your experiments.

Granted, keeping the caching would require a way to invalidate the cache from outside. But as I think about that, maybe its trivial?

@roshanshariff
Copy link
Collaborator

Even if we can invalidate a cache from outside, it's still inefficient to regenerate the cache repeatedly for every buffer every time any metadata changes. A cache only goes so far, and if it's invalidated too often you'll lose the benefits of caching.

Currently, generating the completion candidates is doing a lot of unnecessary work in the template expansion. I think it can be boiled down to just concatenating a few strings, which can in fact be done as-needed. The trick is to do most of the string formatting ahead of time per-bibliography, and only redo it whenever the bibliography changes.

Then the only work that needs to be done to generate the completion candidates is to gather the list of bibliographies, get their associated formatted strings, concatenate them (possibly adjusting for the frame width), and add the has-note/has-file metadata. This last step (I believe) can be done very quickly.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

OIC.

I guess this also goes back to the earlier discussion of as-needed display formatting, which completing-read per doesn't support yet, but its group-function does.

Edit:

In the interim, I guess you're thinking, we can just stash the primary completion string in an entry field while parsing the bibliography?

@roshanshariff
Copy link
Collaborator

I don't think it's a matter of just completing-read supporting as-needed formatting. You pretty much need all the strings in memory if you're going to search them. You may be displaying only a few strings at a time, but as soon as you enter a single character/tab-complete you need to search all the strings in the completion table.

The only thing you can do at display-time is to add non-searchable but visible metadata (like Marginalia does). This doesn't help Citar much because we pretty much want the whole completion string to be searchable, so we must necessarily generate all the completion strings every time we call completing-read. The only thing that is "ornamental" is the column-width truncation that hides text when it's too long to fit on one line, and we can just do that with the completion strings when we generate them (they won't reformat if you resize the Emacs frame while completion is running, but that's a small tradeoff).

In the interim, I guess you're thinking, we can just stash the primary completion string in an entry field while parsing the bibliography?

I was thinking of another hash table in the cache with the display strings, just to avoid messing with the entry too much. Otherwise, it gets hard to keep track of all the non-standard bibliography fields we're adding and they become a de facto part of our public API, which means we can't change implementation details without breaking external code.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

OIC; so basically moving something like the hash (minus the "has" stuff, and with the key-value structure inverted) that is now the completion cache into the bibliography cache?

Makes sense in any case. I get the basic idea, and agree it likely makes sense.

The end result is one cache, that is simple from an API standpoint, but with great functionality, and very snappy performance.

The downside, I'd guess, is the internals of that cache may be complex.

It also occurs to me it'd be best optimized, if one has a large number of entries, for having one large file, that doesn't change much, and maybe one or two that do.

Say a main file, and a "new" file.

Is that right?

@roshanshariff
Copy link
Collaborator

Yeah, exactly. The structure of the cache is actually not that complex, and even that complexity is localized in a few functions.

And yes, it is optimized for that use case, but there's no disadvantage to having a large number of small bibliographies either. It's just that all those bibliographies will be stored in a central cache, but fetching things from that cache is extremely efficient so it doesn't really matter.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

The structure of the cache is actually not that complex ...

I was realizing after I wrote that something like this would probably work to represent a file, and it's indeed simple.

("/home/test/bib/test.bib" 
   :data #<hash-table equal 1/65 0x1b5b1e9> 
   :completions #<hash-table equal 0/65 0x208978f> 
   :checksm "1234" 
   :buffers nil)

@roshanshariff
Copy link
Collaborator

Yeah, that's pretty much what I have now, but defined using a cl-defstruct so you get accessor functions, documentation, etc.

(cl-defstruct (citar--bibliography
               (:constructor citar--make-bibliography (filename))
               (:copier nil))
  "Cached bibliography file."
  (filename
   nil
   :read-only t
   :documentation
   "True filename of a bibliography, as returned by `file-truename`.")
  (hash
   nil
   :documentation
   "Hash of the file's contents, as returned by `buffer-hash`.")
  (buffers
   nil
   :documentation
   "List of buffers that require this bibliography.")
  (entries
   (make-hash-table :test 'equal)
   :documentation
   "Hash table mapping citation keys to bibliography entries,
   as returned by `parsebib-parse`.")
  (formatted
   (make-hash-table :test 'equal)
   :documentation
   "Formatted strings used to display bibliography entries."))

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

but defined using a cl-defstruct so you get accessor functions, documentation, etc.

Oh, good; hadn't thought of that, but I like it a lot!

I had just run into cl-defstruct for the first time recently when working with the org-roam codebase.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 12, 2022

This is a totally minor thing, but do you know if there's a way to get it so that marginalia actually shows the documentation strings in the annotations?

image

Is it worth asking Daniel and Omar about? I don't see anything about it in that issue tracker.

@roshanshariff
Copy link
Collaborator

It looks like cl-defstruct always makes the first line of the documentation be a generic string like what you see in that screenshot. The :documentation string you supply ends up on the next line, and you can see it if you use describe-function etc. It's slightly annoying, but I think you'd need to modify cl-defstruct itself to fix it. The first line of the documentation string is treated specially throughout Emacs, so I doubt it would be easy/desirable to hack around that in Marginalia.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 13, 2022

Ah yeah; that makes sense.

BTW, another minor thing we can control ...

As I'm reading the cl-defstruct docs, we may as well include a :type property on each definition?

@roshanshariff
Copy link
Collaborator

I was looking at that, and it looks like that's mainly for compatibility with Common Lisp? It's ignored in Elisp, and in fact there isn't really a good vocabulary for naming types in Emacs.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 13, 2022

This has some good tips on cl-defstruct etc:

https://nullprogram.com/blog/2018/02/14/

He recommends, and I think I agree, for a constructor name like citar--bibliography-create.

I also think completions may be more clear than formatted (which may be misleading).

 (cl-defstruct (citar--bibliography
        (:constructor citar--bibliography-create (filename))
               (:copier nil))
  "Cached bibliography file."
  (filename
   nil
   :read-only t
   :documentation
   "True filename of a bibliography, as returned by `file-truename`.")
  (hash
   nil
   :documentation
   "Hash of the file's contents, as returned by `buffer-hash`.")
  (buffers
   nil
   :documentation
   "List of buffers that require this bibliography.")
  (entries
   (make-hash-table :test 'equal)
   :documentation
   "Hash table mapping citation keys to bibliography entries,
   as returned by `parsebib-parse`.")
  (completions
   (make-hash-table :test 'equal)
   :documentation
   "Formatted completion strings used to display bibliography entries."))

bdarcus added 3 commits June 13, 2022 08:24
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix #623 Close #627
This functions returns all local and global bibliography files for
'citar--parse-bibliography' to parse.
Allows to independently turn off whether to do this by default, and
whether to toggle the behavior.
@bdarcus bdarcus force-pushed the cache--api-refactor branch from a1d0085 to c2a4855 Compare June 13, 2022 12:24
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 13, 2022

FWIW, I just pushed a change here (rebased from main), but it was only adding the new citar-capf.el file I just merged to main. We'll need to adjust that to this branch.

@Vidianos-Giannitsis
Copy link

From my perspective, this idea seems pretty good. If I understand this correctly, this makes bibliography into a cl-defstruct which should make it much easier to then access each separate object of it. The org-roam-node which is defined this way has a lot of supporting functions (I think they are called methods in CL terminology) which allow you to access each component separately and I think it works very well and is extensible by the user. For example I have defined a method that accesses the buffer of an org-roam-node for use in zetteldesk. Making the citar bibliography like this would then make it much easier to access parts of it I feel.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 13, 2022

Thanks for taking a look!

If I understand this correctly, this makes bibliography into a cl-defstruct which should make it much easier to then access each separate object of it.

It makes the cache a list of such objects, so we have finer-grained control of UI updating, and also, as a consequence the "has-note" indicators in the UI are always up-to-date with the bibliography files (edit: AND possible external note systems like org-roam).

The higher level functions (like citar--get-entry) will mostly stay the same.

Make sense?

@Vidianos-Giannitsis
Copy link

Yeah I get it. This definitely sounds like a step in the right direction. And by not changing the higher level functions it also doesn't break anybody's workflow which is a good thing (unless they use the low level functions at which point they can figure out what to change).

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 13, 2022

@roshanshariff any ETA on when you can open up at least a draft PR?

Even if not all functions are working ATM, I can still weigh on what they are, etc.

@roshanshariff
Copy link
Collaborator

ETA on when you can open up at least a draft PR?

Sure, I can do that in a couple of hours. Pretty much all the code is ready and tested, and the performance looks promising. Just need to do a bit of work to hook it up to the existing citar functions so you can use it.

@bdarcus bdarcus changed the title Split cache, API change, round 3 Refactor the cache architecture Jun 13, 2022
@bdarcus bdarcus changed the title Refactor the cache architecture Refactor cache architecture Jun 13, 2022
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 14, 2022

Closing in favor of #634

@bdarcus bdarcus closed this Jun 14, 2022
@bdarcus bdarcus deleted the cache--api-refactor branch August 8, 2022 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split data and completion caches, change APIs to use citekeys Optimize and generalize has-notes
4 participants