-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support org-bibtex as a bibliographic backend #397
Comments
I really know nothing about org-bibtex, and what would be involved in supporting it. @JonathanReeve - how would it work from a config POV? Would you add these org files as part of Any input on this from an implementation POV @andras-simonyi? Would I just read the file using |
I wonder if this is something parsebib can support, and org-cite itself. Looking at the citeproc.el code, it includes some functions to parse org-bibtex (see example), but I'm not sure how comfortable I am including that here. Even if it's not so bad to display, I'd guess performance will be terrible, and other features would be missing (would only work with oc-csl, for example; and open-notes would not work without adding code for it). I think this is one of those things I probably won't have time to support, but would consider a PR. Until that happens, might be best to just auto-run (say via a hook or something) FWIW, the current design for notes is more optimized around org-roam (though doesn't in any way require it of course). |
I think it would be rather trivial to add some support for that format utilizing the functions provided by |
@bdarcus why do you think that performance would suffer? My impression was that it'd be enough to parse the |
There are really two pieces of functionality here:
If it's true the parsing performance is a non-issue, then it doesn't really matter. I was just assuming it would be a lot slower than bibtex or json. I have no data to back up that assumption. Aside: in playing with org-bibtex functions, I run into unexpected behavior. Almost none of it behaves as I expect. For example, if I do |
I tend to agree that parsing another bibliography format is best handled in parsebib, but looking at So this raises the question, what library is used to read the relevant data? |
org-bibtex.el? |
I thought I had read somewhere that |
Hah; you probably know more than I. |
Seems @andras-simonyi is using three functions for this in |
Yes, |
I thought you said performance shouldn't be an issue? |
What I meant was that after parsing performance should be the same in citar. As for parsing, in principle it shouldn't be an issue either if |
No need; I was just curious. So to clarify my earlier point on "performance", I also meant parsing. But that can be up to a user to balance, if it's an issue. |
@JonathanReeve - per the discussion, support here would depend on upstream enhancements to parsebib. If and when that's added, library browsing should automatically be upgraded, and we discuss other possible enhancements here. |
I'll see about adding that in the near future. Where can I learn about the details of the format? Are the field names taken directly from BibTeX? What about biblatex? And what about the field values? If the format is essentially BibTeX-based, does it allow For |
@andras-simonyi @JonathanReeve - any info? |
All I know about the format is from those .el files. Just from a glance at the code, it looks like it supports cross-references. One thing I'm not sure it does, but would love it to do, is to infer a cross-reference-like hierarchy from the org hierarchy. That way you can have an org heading with type |
Unfortunately, similarly to Jonathan, I only know about the format what I could gather from the implementation in |
Here's a similar use case, but using embedded BibTeX. https://www.reddit.com/r/emacs/comments/rsyqxu/literate_annotated_bibliography_wip/ Ideally the solution could support both (though the BibTeX would be much simpler of course; I guess this is what tangle is for?). |
I created an issue on the parsebib tracker, and linked it here. |
Any plans on this very interesting approach, or similar workflows people have found? An advantage of bibtex-fields-as-properties is you can use org-agenda to do detailed searches on date, journal name, etc. And of course include notes, images, citations to other articles, even the PDF of the article itself, all in one place. |
Well, per discussion above, I think part of this would depend on parsebib enhancements, and even then, there are open questions on this end. See, for example: #397 (comment). I don't really consider it a high priority, particularly if it comes with big implementation headaches, or would result in a less-than-elegant-or-consistent UX. 🤷 |
I see. Perhaps there's an interesting middle-ground: use |
Yes; definitely. Performance would likely also be much better, I'd guess. |
The problem with org-bibtex is that it's pretty addictive once you start to use it 😃 (it supports some very attractive workflows) and -- given the already existing infrastructure -- it's rather trivial to add support to tools like Citar. For reference, I've just posted this gist on Reddit, which patches Citar to kind of work with org-bibtex bibliographies directly. I know that adding support to Parsebib would be more appropriate but I needed the functionality quickly. I plan to come up with a Parsebib PR when I find the time but it's not a priority. Just for the record, if property inheritance is disregarded (and I don't use it) then I don't think that there should be any noticable performance difference compared to parsing other bibliography formats. In a sense, org-bibtex is just a simple notational variant of BibTeX and co. |
Cool! "Kind of works"? How do you find the UX? For reference, here's the reddit post. https://www.reddit.com/r/emacs/comments/wqjare/comment/iku77h0/ |
Well, I meant only that I don't see a difference compared to "normal" BibTeX or biblatex entries, but there might be some things which I haven't noticed yet.
See above -- if the question is about Citar in general, then I love the UI, the only wrinkle I remember noticing (using embark) is that it makes me choose the associated URL from a list even if there is only one. |
Should be faster now. Though I did not manage to do much about performance of indenting the 9.4Mb file. If not indentation, things should be imported in a few seconds. |
I don't have time ATM to test the huge file and Ihor's branch, but I did test (my modified version of) your "toy" code, and it works fine with my pinned version of the main org branch. Note that BTW, not sure if this is equivalent, but here's with a cached ~1000 entry biblatex file: ELISP> (benchmark-run-compiled (citar-get-entries))
(0.0007769770000000001 0 0.0) |
You can store processed version of a headline alongside with Org cache with |
@bdarcus what are your thoughts on leaving the data (including perhaps a TeX-cleaned copy) in the auto-repairing org-element cache vs. duplicating it in a hash table? What all information is needed on the completing-read/embark side for all the various actions that can be taken on a given selected ref? I ran your
So once the hash is built, it is instantly available after this. But as soon as you edit anything... "Updating Bibliography" and 20 more seconds of wait. |
To avoid needing the citar cache, we'll need to sort out the two linked issues #681 and #680. I have a feeling that will be tricky. Also, note: parsing is fast regardless. What's slow is formatting for display. E.g. So when citar (re)parses files, it also (re)generates the preformatted display strings, which it then concatenates with the "has related resource" symbols every time one runs It might be, then, this still needs the cache for that. See the Could also be we do this iteratively, with deeper performance optimizations coming later (which will probably require the more radical step that Ihor was suggesting, so that at least the formatting is done real-time, but incrementally)? |
I wouldn't necessarily agree, given 20s for 10krefs every time the file changes! That's obviously a heavy case, but not that extreme.
That's unfortunate. So for my example case you'd reformat 10k strings after every invocation? I think the way consult-*+marginalia handle this is to format the 10 or 20 lines "just in time" as they are scrolled into view, and cache the formatted text.
One advantage is that for org-records this info would be "all in one" under the same heading, so it would be easy to grab those details from the element in a single pass. |
I'm just saying most of that time isn't in fact parsing (notwithstanding a temporary parsebib regression).
No; just once, until the data changes. The cache stores the "preformatted" strings (the candidates, minus the "has" prefix). Compare these: (map-values
(citar-cache--bibliography-entries
(citar-cache--get-bibliography "~/example.bib")))
(map-values
(citar-cache--bibliography-preformatted
(citar-cache--get-bibliography "~/example.bib"))) The "has" prefix really has to be regenerated every time else that data can get out of sync. But it's fast enough it shouldn't matter.
Marginalia is just annotation functions (+ supporting code), which by definition are incremental (the argument for an annotation function is a single candidate). Not sure about Consult; that might be worth taking a closer look at, in conjunction with the But yeah: a custom programmed completion function that incrementally formats the candidates is probably where we need to end up; hence #681. When we have a better handle on this, we might ask minad for some feedback, but I don't want to bother him ATM. |
On consult, I think the "just in time" part of it is in the asychronous support, which you can see in action with |
I don't think its suitable. That code is tailored to inferior processes (created with |
@yantar92 - got it. Where's the code you were mentioning earlier with this?
Is that in your custom config, or do you mean with org-ql itself? For anyone wanting to experiment with this (I don't have the time or skill), I'd suggest first step is making sure you understand the current capabilities of the new citar cache, rather than only the limitations with one very large org file that changes a lot, which has not been our priority. I've tried to list those high-level requirements in #681 under the "requirements" section. To be clear, though, not sure if that more comprehensive solution is necessary for a first step. It just probably is for @jdtsmith's use case. PS - one downside we haven't discussed about org-bibtex; org-cite doesn't support it. |
With org-ql itself. alphapapa/org-ql@master...yantar92:org-ql:master |
Maybe the same fast cache+read framework being discussed could be re-used to shim org-bib data into org-cite? |
I'm not sure what you mean -- |
Sorry; meant oc-basic mostly, but also the latex processors.
It's not a big deal in any case.
|
Patches welcome :) |
Thinking a bit more, perhaps the easiest interim path is to address #680 (use the cache, but make it smarter and more configurable), and then do just this (convert the list into a hash-table), but use @andras-simonyi's approach to detecting bib change so it's only regenerated when the bib data changes. Maybe there's a better approach, but this would work, and is simple and clear. (defun citar-org-bibtex--collect-bib (el)
"Return an alist of bib fields for EL."
(cons (org-bibtex-get org-bibtex-key-property)
(mapcar
(lambda (p)
(let* ((fieldn (cadr (split-string (downcase (symbol-name p)) ":")))
(fieldv (org-element-property p el)))
(cons fieldn fieldv)))
'(:BTYPE :AUTHOR :TITLE :YEAR :JOURNAL))))
(defun citar-org-bibtex--obt-parse-buffer (&optional _filename)
"Parse org-bibtex buffer and return hash-table."
(map-into
(org-element-cache-map #'citar-org-bibtex--collect-bib
:granularity 'headline :next-re ":BTYPE:") 'hash-table)) |
Sounds reasonable. And/or if the on-element-change hook @yantar92 mentions comes to pass, you could directly update the hash table rather than re-build it each time a change occurs in some element(s). I also suspect sanitizing the data for TeX literals could be done together with incremental formatting, just in time at display time. Some UIs/helpers like marginalia already cache annotation results, btw. |
It does raise the question for me again if the parsing part of this should be in parsebib instead (I updated the example code above so that it generates the sort of hash-table that parsebib does), so that we can just do But we'd still need the function mentioned by Roshan to know when to run that, and we may well while we're at it remove the hard-coded Regardless, I started a |
Note |
@yantar92 - thanks. Is it better to use Also, I'm trying to do a PR for parsebib to add the parser, but I get this: Debugger entered--Lisp error: (error "Cache must be active.")
error("Cache must be active.") The |
|
Org parser only works when org-mode is the major mode. |
So to go back to @jdtsmith's earlier question, for org-bibtex files, it probably makes sense to just open the file(s) if not already open, in a standard buffer. That would be a deviation from how citar and parsebib now work, though for citar, we can just document the difference. |
@yantar92 this is only tangentially related, but I was trying to "plump-up" my testbib.org file with some lorem ipsum content. I found this to be absurdly slow on this 10krecord file. Looks like almost all the time is in (org-map-entries
(lambda ()
(when (< (random 100) 30)
(save-restriction
(org-narrow-to-subtree)
(goto-char (point-max))
(org-insert-subheading 1)
(insert "Abstract\nSed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?\n")
(when (< (random 100) 50)
(org-insert-heading)
(insert "Notes
- At vero eos et accusamus et iusto odio dignissimos ducimus
- qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et
- quas molestias excepturi sint occaecati cupiditate non provident,
- similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.
* Et harum quidem rerum facilis est et expedita distinctio.
* Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus.
- Temporibus autem quibusdam et aut officiis debitis aut rerum
* necessitatibus saepe eveniet ut et
* voluptates repudiandae sint et molestiae non recusandae.
* Itaque earum rerum hic tenetur a sapiente delectus,
* ut aut reiciendis voluptatibus maiores alias consequatur aut
* perferendis doloribus asperiores repellat\n")))))) |
Which Org version did you use? |
Yours: b5a135960 feature/org-fold-universal-core |
Please do not use that one. It diverges from main in some aspects. Either use feature/org-font-lock-element (if you need to play around with cache hooks) or better official main branch. |
* lisp/org-element.el: Do not disable GC. This can make Emacs hang in some particularly bad scenarios. It is better to lose on performance a bit compared to Emacs GC hanging. The edge case is described in emacs-citar/citar#397 (comment)
Is your feature request related to a problem? Please describe.
I've been really enjoying Citar lately, and I'd love to get it to work with my favorite bibliographic format,
org-bibtex
.In org-bibtex, instead of maintaining separate files for your notes and your bibliographic metadata, and therefore requiring functions for jumping between the two, you just have everything written in your org file. The heading which contains your notes has a properties drawer that contains elements like author, title, year, and so on. You don't need
parsebib
or another library to parse it, since it's already in a structured data format:Instead of jumping between two files all the time, you can edit your notes and your source's metadata all from the same place, and hide it when you don't need it by closing the property drawer.
Best of all, org-bibtex is already included in org by default, and it's already supported by citeproc-el.
Describe the solution you'd like
Add a backend for bibliographic metadata that reads the properties of note files, and, if those properties include the bibliographic type, use that as the metadata source.
The text was updated successfully, but these errors were encountered: