-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an API for altering XML documents #10
Comments
Hi @p-himik, I’m glad you found the library useful! Adding support for the XMLModifier API sounds good if there’s an ergonomic way we could present this in Clojure. Unfortunately my time is spread very thin these days but if you’d like to try contributing something, I’d be happy to review. |
Turns out it's quite a bit trickier than I'd like:
Given that, I'm definite not keen on creating a generic API. Below is something that I came up with for my current needs, in case someone else finds it useful (note also that it has a (ns ...
(:require [riveted.core :as vtd])
(:import (com.ximpleware VTDNav XMLModifier)
(riveted.core Navigator)))
(set! *warn-on-reflection* true)
(defn serialize-doc-to-string [doc]
(vtd/fragment (vtd/parent doc)))
(defn element-fragment-offset+len [^VTDNav nav]
(let [r (.getElementFragment nav)]
(when (not= r -1)
[(bit-and r 16rFFFFFFFF) (bit-shift-right r 32)])))
(let [m (doto (.getDeclaredMethod VTDNav "getCharUnit" (into-array Class [Integer/TYPE]))
(.setAccessible true))]
(defn -get-char-unit [^VTDNav nav ^long idx]
(.invoke m nav (object-array [(int idx)]))))
(defn offset-inside-empty-element?
"Returns true if the offset points to the character
near the end of an empty element.
<some-tag/>
^- here
Counter-intuitive, but that's what VTD-XML returns
when asked for e.g. `.getOffsetBeforeTail`.
Also deals with an apparent bug where VTD-XML means to
flag such an offset but actually doesn't when there's anything
in between the end of the empty element and the start of the next element,
e.g. as in
<some-tag/><!-- comment --><some-other-tag>..."
[offset ^VTDNav nav]
(or (= 16rFFFFFFFF (bit-shift-right offset 32))
(and (= (-get-char-unit nav (inc offset)) (long \/))
(= (-get-char-unit nav (+ offset 2)) (long \>)))))
(defn element-fragment [^Navigator nav]
(when-let [^VTDNav nav (.-nav nav)]
(if-let [[offset len] (element-fragment-offset+len nav)]
(.toString nav offset len)
"")))
(defn get-direct-text [node]
;; Not using `vtd/text` here because it does a lot of work to retrieve
;; all text contained within an element, whereas we only need the first
;; and only text node in the element.
;; It also uses `.toNormalizedString` but we don't need any normalization.
(when node
(let [^VTDNav nav (.-nav ^Navigator node)
idx (.getText nav)]
(when (pos? idx)
#_(.toString nav idx)
(.toNormalizedString nav idx)))))
(defn ->bytes ^bytes [data]
(if (string? data)
(.getBytes ^String data)
data))
(defrecord Modifier [^XMLModifier mod ^VTDNav nav
insertion_buffer])
(defn create-modifier [^Navigator nav]
(let [nav (.-nav nav)]
(Modifier. (XMLModifier. nav) nav (atom {:offset->data {}
:offset->tag {}}))))
(let [vec-conj (fnil conj [])]
(defn add-insert! [^Modifier mod ^Navigator nav offset data]
(swap! (.-insertion_buffer mod)
(fn [b]
(let [^VTDNav nav (.-nav nav)]
(cond-> (update-in b [:offset->data offset] vec-conj (->bytes data))
(offset-inside-empty-element? offset nav)
(update-in [:offset->tag offset]
(fn [t]
(let [idx (.getCurrentIndex nav)
new-t (.toString nav idx)]
(when (and (some? t) (not= t new-t))
(throw (ex-info "The same offset points at elements with different tags"
{:offset offset, :idx idx, :old-tag t, :new-tag new-t})))
new-t)))))))
nil))
(defn update-offset-or-len-for-encoding
"Has to be used with all values that are derived from `.getTokenOffset`."
[x encoding]
(cond-> x
(>= encoding VTDNav/FORMAT_UTF_16BE)
(bit-shift-left 1)))
(defn modifier->navigator! [^Modifier mod]
(let [^XMLModifier xml-mod (.-mod mod)
^VTDNav nav (.-nav mod)
encoding (.getEncoding nav)
char-width (update-offset-or-len-for-encoding 1 encoding)
{:keys [offset->data offset->tag]} @(.-insertion_buffer mod)]
(doseq [[offset data] offset->data]
(let [empty-el? (offset-inside-empty-element? offset nav)
^bytes tail (when empty-el?
(let [tag (offset->tag offset)]
(assert tag (pr-str offset->tag))
(.getBytes (str "</" tag ">"))))
data-length (reduce (fn [l ^bytes d]
(+ l (alength d)))
0 data)
total-length (cond-> data-length empty-el? (+ (alength tail)))
all-data (byte-array total-length)]
(reduce (fn [offset ^bytes d]
(let [n (alength d)]
(System/arraycopy d 0 all-data offset n)
(+ offset n)))
0 data)
(if empty-el?
(let [slash-offset (+ offset char-width)]
(System/arraycopy tail 0 all-data data-length (alength tail))
(.removeContent xml-mod slash-offset char-width)
(.insertBytesAt xml-mod (+ offset (* 3 char-width)) all-data))
(.insertBytesAt xml-mod offset all-data))))
(Navigator. (.outputAndReparse xml-mod))))
(defn replace-element! [^Modifier mod ^Navigator nav data]
(let [^VTDNav vtd-nav (.-nav nav)
fr (.getElementFragment vtd-nav)
offset (unchecked-int fr)
len (unsigned-bit-shift-right fr 32)
^XMLModifier mod (.-mod mod)]
(.removeContent mod offset len)
(when data
(.insertBytesAt mod offset (->bytes data)))))
(defn replace-content! [^Modifier mod ^Navigator nav data]
(let [^VTDNav vtd-nav (.-nav nav)
fr (.getContentFragment vtd-nav)
offset (unchecked-int fr)
len (unsigned-bit-shift-right fr 32)
^XMLModifier mod (.-mod mod)]
(.removeContent mod offset len)
(when data
(.insertBytesAt mod offset (->bytes data)))))
(defn insert-after-element! [^Modifier mod ^Navigator nav data]
(let [[offset len] (element-fragment-offset+len (.-nav nav))]
(add-insert! mod nav (+ offset len) data)))
(defn insert-before-element! [^Modifier mod ^Navigator nav data]
(let [^VTDNav mod-nav (.-nav nav)
start-tag-index (.getCurrentIndex mod-nav)
offset (-> (dec (.getTokenOffset mod-nav start-tag-index))
(update-offset-or-len-for-encoding (.getEncoding mod-nav)))]
(add-insert! mod nav offset data)))
(defn insert-after-head! [^Modifier mod ^Navigator nav data]
(let [offset (.getOffsetAfterHead ^VTDNav (.-nav nav))]
(when (neg? offset)
(throw (ex-info "Inserting content into empty elements is not supported" {:nav nav})))
(add-insert! mod nav offset data)))
(let [m (doto (.getDeclaredMethod VTDNav "getOffsetBeforeTail" (into-array Class []))
(.setAccessible true))
a (into-array [])]
;; No clue why `VTDNav.getOffsetBeforeTail` is protected
;; when `VTDNav.getOffsetAfterHead` isn't.
(defn -get-offset-before-tail [^VTDNav nav]
(.invoke m nav a)))
(defn insert-before-tail! [^Modifier mod ^Navigator nav data]
(let [offset (-get-offset-before-tail (.-nav nav))]
(if (= offset -1)
(insert-after-head! mod nav data)
(add-insert! mod nav offset data)))) |
Thanks for sharing that, @p-himik. |
Thanks for this library! I got recommended to try it out and was blown away by the performance compared to other solutions that I've tried.
With that being said, being able to also modify XML documents is the only thing that I currently miss. It seems that the underlying library makes it possible via
com.ximpleware.XMLModifier
.What do you think about adding such an API? If you're open to the idea but don't want to do it yourself, I might be able to submit a PR once I'm familiar enough with VTD-XML.
The text was updated successfully, but these errors were encountered: