-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First load is too slow #659
Comments
@mclerc, I'll have to look into what's causing this, since in my testing a bibliography file of that size took much less time. Would you be willing to share your bibliography, privately if you prefer? |
Sure -- here it is. |
Oh and if it is of use here's my config: (use-package citar
:straight (:host github :repo "emacs-citar/citar")
:commands (citar-open-beref
citar-open-notes
citar-insert-citation)
:bind (:map citar-map
("b" . #'citar-open-beref))
:custom
;; Use with org citation
(org-cite-global-bibliography `(,lem-bibliography))
(citar-bibliography `(,lem-bibliography))
(org-cite-insert-processor 'citar)
(org-cite-follow-processor 'citar)
(org-cite-activate-processor 'citar)
:config
;; use embark with at-point
(setq citar-at-point-function 'embark-act)
(setq citar-default-action 'citar-open-beref)
;; add beref entry for bookends
(setq citar-additional-fields '("doi" "url" "beref"))
(setq citar-templates
`((main . " ${=key= id:15} ${title:48}")
(suffix . "${author editor:30} ${=type=:12} ${=beref=:12} ${tags keywords:*}")
(preview . "${author editor} (${year issued date}) ${title}, ${journal journaltitle publisher container-title collection-title}.\n")
(note . ,lem-citar-note)))
(when (display-graphic-p)
(setq citar-symbols
`((file ,(all-the-icons-octicon "file-pdf" :face 'error) . " ")
(note ,(all-the-icons-octicon "file-text" :face 'warning) . " ")
(link ,(all-the-icons-octicon "link-external" :face 'org-link) . " "))))
;; edit notes
(setq citar-notes-paths `(,lem-bib-notes))
;; Citar & Bookends
(defun citar-get-beref (entry)
(let* ((field (citar-has-a-value '(beref) entry))
(base-url (pcase field
('beref "bookends://sonnysoftware.com/"))))
(when field
(concat base-url (citar-get-value field entry)))))
(defun citar-open-beref (keys-entries)
"Open bookends link associated with the KEYS-ENTRIES in bookends.
With prefix, rebuild the cache before offering candidates."
(interactive (list (citar-select-refs
:rebuild-cache current-prefix-arg)))
(dolist (key-entry keys-entries)
(let ((link (citar-get-beref (cdr key-entry))))
(if link
(browse-url-default-browser link)
(message "No ref found for %s" key-entry)))))) |
Just putting this note from Roshan here:
|
On some profiling on a file with about 1500 entries, it seems like about half the time is spent inside I think this was a recent addition to parsebib so might have been responsible for the slowdown and not the new caching mechanism. Edit: For me the evaluating |
That might make sense @aikrahguzar. @joostkremers added that on June 16, so actually before we merged the cache update. Does that timing fit your experience @mclearc? Still, that doesn't explain the discrepancy in performance between the two of them. If, in fact, that explains it, any thoughts on WTD @roshanshariff? I wonder also if there's room for optimizations in that parsebib function (though I can't see how)? |
Well, One could remove some items from
Profiling the code suggest that Another option that comes to mind: citar could perhaps do the loading of the Alternatively, perhaps the conversion can be done on the fly, only for the field values that are actually shown to the user. I really have no idea if that would even be possible to implement, though. |
My own experience on a less than year old laptop seems more in line with 15-20 seconds for a bib file with 7200 entries. |
I think this time is well spent and the big default is a good thing since it would be a pain to recreate it in a personal thing and it does lead to much nicer looking results. Somewhere in the citar documentation people with large bib files should be aware of it and then as you said they can customize
My impression is that building a huge regexp with a lot of parenthesized sub-expressions should be faster. I think another that can help is to make elements of |
I suppose one simple thing we could do is a shadow variable, that we use to let-bind the parsebib
One could also always use CSL-JSON instead, and/or there's #653, which I still think may be the ideal solution longer term, once we've optimized everything else. |
If I find some time, I can try it out and see if it brings any improvements. My laptop is grossly underpowered, so it will be noticeable immediately. 😆 One possibly complicating factor is that the replacement in this case needs to be a function, which would need to look up the actual replacement string in an alist. That may slow things down again.
Yeah, I added a commit a few days ago that provides an exclusion list. Currently, it only contains In fact, given the resource-intensiveness, which I didn't think about before, it may actually make sense to make it an inclusion list. @bdarcus what do you think? |
I think probably yes also. I'd be interested in input also from others on this thread though. @roshanshariff, for example, wrote the new cache code, and has been doing all the performance tuning on our side. |
Since the regexp is build from an alist and we know the order of subexpressions it can be loop over the alist instead of lookups for each match. I think that shouldn't be too bad.
I think it makes sense to allow regexps to be specific to the field, so something like an alist where the car is the list of fields and the cdr is the alist of regexps and replacements. I am not sure that is the best design but I think it makes sense to be able to target regexps with some specificity. |
Hmm, interesting idea. I'll see if I can implement it.
I'm not sure that would be very useful. I don't think there are TeX constructs that are (much) more likely to occur in some fields but not in others. Do you have any specific examples in mind? I do think it makes sense to limit |
FWIW, I've removed this from the 1.0 milestone, since I don't think this should hold us up from tagging 1.0. It seems most of the code changes will happen on the parsebib side, and in any case won't impact our API. |
You are right about this, the only example I can think of different calligraphic letters that are only likely to occur in the title of math papers and accents that are more likely in the author names (but that is not going to true of fields where not all papers are in English). My motivation was not actually cleaning markup but to use this mechanism as a general way of getting a pretty representation of field values since it is already very close to that. One example that has come up with regards to |
Sorry, I was out of the loop for a bit. Thanks @aikrahguzar for digging into this more! That would explain why I couldn't initially reproduce the slowdown; my I won't be able to look at the code for a few days, but based on the discussion here, my initial impression is that we shouldn't perform the de-texification as part of This way, code that needs the raw field value (like for the This might not be enough to completely solve the performance problem. I have some ideas for futher mitigations, but will have to look at the code to see what's reasonable. I'll probably get to that on the weekend. |
The scenario where @mclearc was experiencing pain was when a very large file changes frequently, which will probably never exactly be fast (I still think we need #653 for that). Wouldn't your suggested changes here merely move where the work is done, but otherwise not improve overall performance? Or that the improvement would be marginal? I guess you'll only know when you have a chance to test though. In any case, it would still help us (and other parsebib consumers) if @joostkremers can find a way to increase the performance of that function. EDIT: I guess one potential other advantage of having Benchmarks, etc.BibTeXBenchmark comparisons ATM, with a file with less than 1,000 entries: ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display t))
(41.728177873 130 23.535983567999992)
ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display nil))
(1.5515592139999999 2 0.36761948199999495) FWIW, I don't see any performance difference in limiting the ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display nil :fields (list "title")))
(1.567477618 2 0.3720656209999902) CSL-JSONFinally, we do need to remember that we, and parsebib, also support Note this file has more entries than the one I use above. ELISP> (benchmark-run-compiled 10 (parsebib-parse (list "~/org/bib/MyLibrary.json") :display nil))
(0.591864735 1 0.189613928)
ELISP> (benchmark-run-compiled 10 (parsebib-parse (list "~/org/bib/MyLibrary.json") :display t))
(1.693483346 3 0.7157699330000042) So performance with the JSON is really good, even with Here, BTW, is what the values look like with ((id . "zonszein_2020-amazing-image")
(abstract . "Amazing image of rally in Tel Aviv now against corrupt Netanyahu and the damage to democracy. Every protestor standing on Xs marking 6 feet apart from one another. https://t.co/Qhd9h6Za6s\" / Twitter")
(accessed
(date-parts .
[[2020 4 19]]))
(author .
[((family . "Zonszein")
(given . "Mairav"))])
(container-title . "Twitter")
(issued
(date-parts .
[[2020 4 19]]))
(language . "en")
(source . "twitter.com")
(title . "Amazing image of rally in Tel Aviv ...")
(title-short . "Mairav Zonszein מרב זונשיין on Twitter")
(type . "webpage")
(URL . "https://twitter.com/MairavZ/status/1251928834267443201")) If we go that direction, the cache will need to track what format the raw data is under |
Removing markup is a one time but expensive operation and that seems like a very good use case for caching. I think trying to do it on the fly is unlikely to work since we can't do is lazily just for the entries that are on the screen. And it is slow for a good reason. So I think for managing large bibliographies we have to document it and direct people to have less markup and perhaps make it possible to do it just for citar by providing a variable which we let bind. |
FWIW, I believe this is a current limitation of In Helm, and I think also Ivy, they support this via the notion of display transforms, which would be nice to have here. But this project has always tried to find creative ways to work around |
I don't think so. As soon as you present candidates to the user for completion, they must be in an appropriate form to match against (in your case the parsed bibtex entries converted to a displayable string). Display transformers only work for superficial candidate formatting/transformation, e.g., changing the colors of directories in find-file or modes in M-x. If you would use something like consult--async you could load the candidates in the background but I am not sure if that would lead to a satisfactory UI. For consult-grep it is more expected that the candidates appear in batches. There is probably no magic bullet here:
|
Yes, unless you have one very large bib file (in this case, with 7200 entries) that does change frequently. In this case, the OP was seeing load times of 25-30 seconds. But we've always known that case can only be optimized so much. Still, some mix of your suggestions (and maybe @aikrahguzar's) for the parsebib function and for citar should help a lot. |
I think caching the de-TeXed/prettified fields is reasonable. There only needs to be one universal cache for this, since the rules for transforming field values to their display form don't change. And with some clever use of weak references, you never have to clear this cache (it won't keep growing indefinitely). This should speed up updating bib files on modification, because presumably most of the strings in the bib file will remain the same. We still need the clean-up function to be fast so the initial load doesn't take too long, but I think optimizing the regexp search/replace will accomplish that. I'll be experimenting with this over the next couple of days. |
Not to be pushy @roshanshariff, but any update? If this looks like it may be more involved, I think I'll just do the doom module update now (which is mostly about |
Yeah, it's going to be a little more involved; there are some niggling design details to get right. Feel free to do the update in the meantime, @bdarcus. |
Doom update is merged; it includes now a specific parsebib BTW, @roshanshariff, if you haven't noticed already, @aikrahguzar has a linked draft parsebib PR that speeds up that function pretty dramatically. It seems it will be a couple of weeks before that can be merged over there (Joost is on vacation), and when it is, it should largely fix this problem.
https://www.reddit.com/r/emacs/comments/wk4dur/citar_10_citarorgroam_doom_biblio_update/ |
Just to update, when I tested @aikrahguzar's draft PR code at joostkremers/parsebib#25 about a week ago, it improved loading performance by about 50%. I'd assume that would get us back to what it was before these big changes. |
As I mentioned before, I think it might be a good idea to use an inclusion list after all, rather than an exclusion list. For one, it wouldn't result in an unsuspected increase in parsing time if a user decides they want to retrieve some additional fields from the So If I don't get any complaints, remarks or comments, I'll make this change some time next week. |
Joost merged this PR, which should fix this issue. Please test and report back if not. |
On the latest commit (really I think since the change to the new cache system) I've noticed that citar takes 25-30 seconds to load on first being called (and same for any subsequent change in the bibfile). I have a somewhat large .bib file (~7200) but it did not take this long to load it previously. Is there a way to avoid this without splitting my bib file into multiple smaller ones (something I am not intending to do)? Thanks!
Originally posted by @mclearc in #653 (comment)
The text was updated successfully, but these errors were encountered: