Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API for altering XML documents #10

Open
p-himik opened this issue Apr 7, 2023 · 3 comments
Open

Add an API for altering XML documents #10

p-himik opened this issue Apr 7, 2023 · 3 comments

Comments

@p-himik
Copy link

p-himik commented Apr 7, 2023

Thanks for this library! I got recommended to try it out and was blown away by the performance compared to other solutions that I've tried.

With that being said, being able to also modify XML documents is the only thing that I currently miss. It seems that the underlying library makes it possible via com.ximpleware.XMLModifier.

What do you think about adding such an API? If you're open to the idea but don't want to do it yourself, I might be able to submit a PR once I'm familiar enough with VTD-XML.

@mudge
Copy link
Owner

mudge commented Apr 8, 2023

Hi @p-himik,

I’m glad you found the library useful!

Adding support for the XMLModifier API sounds good if there’s an ergonomic way we could present this in Clojure.

Unfortunately my time is spread very thin these days but if you’d like to try contributing something, I’d be happy to review.

@p-himik
Copy link
Author

p-himik commented Apr 28, 2023

Turns out it's quite a bit trickier than I'd like:

  • XMLModifier doesn't expose its navigator, so I had to create a wrapper that would store both a navigator and a modifier
  • There seems to be at least one bug in VTDNav where it incorrectly determines that an offset points to an empty element like <a/>
  • XMLModifier throws if you insert multiple times into the same offset for some reason. Had to work around that by creating an intermediate insert buffer

Given that, I'm definite not keen on creating a generic API. Below is something that I came up with for my current needs, in case someone else finds it useful (note also that it has a get-direct-text function which I find to be quite handy):

(ns ...
  (:require [riveted.core :as vtd])
  (:import (com.ximpleware VTDNav XMLModifier)
           (riveted.core Navigator)))

(set! *warn-on-reflection* true)

(defn serialize-doc-to-string [doc]
  (vtd/fragment (vtd/parent doc)))

(defn element-fragment-offset+len [^VTDNav nav]
  (let [r (.getElementFragment nav)]
    (when (not= r -1)
      [(bit-and r 16rFFFFFFFF) (bit-shift-right r 32)])))

(let [m (doto (.getDeclaredMethod VTDNav "getCharUnit" (into-array Class [Integer/TYPE]))
          (.setAccessible true))]
  (defn -get-char-unit [^VTDNav nav ^long idx]
    (.invoke m nav (object-array [(int idx)]))))

(defn offset-inside-empty-element?
  "Returns true if the offset points to the character
  near the end of an empty element.
      <some-tag/>
              ^- here
  Counter-intuitive, but that's what VTD-XML returns
  when asked for e.g. `.getOffsetBeforeTail`.
  Also deals with an apparent bug where VTD-XML means to
  flag such an offset but actually doesn't when there's anything
  in between the end of the empty element and the start of the next element,
  e.g. as in
      <some-tag/><!-- comment --><some-other-tag>..."
  [offset ^VTDNav nav]
  (or (= 16rFFFFFFFF (bit-shift-right offset 32))
      (and (= (-get-char-unit nav (inc offset)) (long \/))
           (= (-get-char-unit nav (+ offset 2)) (long \>)))))

(defn element-fragment [^Navigator nav]
  (when-let [^VTDNav nav (.-nav nav)]
    (if-let [[offset len] (element-fragment-offset+len nav)]
      (.toString nav offset len)
      "")))

(defn get-direct-text [node]
  ;; Not using `vtd/text` here because it does a lot of work to retrieve
  ;; all text contained within an element, whereas we only need the first
  ;; and only text node in the element.
  ;; It also uses `.toNormalizedString` but we don't need any normalization.
  (when node
    (let [^VTDNav nav (.-nav ^Navigator node)
          idx (.getText nav)]
      (when (pos? idx)
        #_(.toString nav idx)
        (.toNormalizedString nav idx)))))

(defn ->bytes ^bytes [data]
  (if (string? data)
    (.getBytes ^String data)
    data))

(defrecord Modifier [^XMLModifier mod ^VTDNav nav
                     insertion_buffer])

(defn create-modifier [^Navigator nav]
  (let [nav (.-nav nav)]
    (Modifier. (XMLModifier. nav) nav (atom {:offset->data {}
                                             :offset->tag  {}}))))

(let [vec-conj (fnil conj [])]
  (defn add-insert! [^Modifier mod ^Navigator nav offset data]
    (swap! (.-insertion_buffer mod)
           (fn [b]
             (let [^VTDNav nav (.-nav nav)]
               (cond-> (update-in b [:offset->data offset] vec-conj (->bytes data))
                 (offset-inside-empty-element? offset nav)
                 (update-in [:offset->tag offset]
                            (fn [t]
                              (let [idx (.getCurrentIndex nav)
                                    new-t (.toString nav idx)]
                                (when (and (some? t) (not= t new-t))
                                  (throw (ex-info "The same offset points at elements with different tags"
                                                  {:offset offset, :idx idx, :old-tag t, :new-tag new-t})))
                                new-t)))))))
    nil))

(defn update-offset-or-len-for-encoding
  "Has to be used with all values that are derived from `.getTokenOffset`."
  [x encoding]
  (cond-> x
    (>= encoding VTDNav/FORMAT_UTF_16BE)
    (bit-shift-left 1)))

(defn modifier->navigator! [^Modifier mod]
  (let [^XMLModifier xml-mod (.-mod mod)
        ^VTDNav nav (.-nav mod)
        encoding (.getEncoding nav)
        char-width (update-offset-or-len-for-encoding 1 encoding)
        {:keys [offset->data offset->tag]} @(.-insertion_buffer mod)]
    (doseq [[offset data] offset->data]
      (let [empty-el? (offset-inside-empty-element? offset nav)
            ^bytes tail (when empty-el?
                          (let [tag (offset->tag offset)]
                            (assert tag (pr-str offset->tag))
                            (.getBytes (str "</" tag ">"))))
            data-length (reduce (fn [l ^bytes d]
                                  (+ l (alength d)))
                                0 data)
            total-length (cond-> data-length empty-el? (+ (alength tail)))
            all-data (byte-array total-length)]
        (reduce (fn [offset ^bytes d]
                  (let [n (alength d)]
                    (System/arraycopy d 0 all-data offset n)
                    (+ offset n)))
                0 data)
        (if empty-el?
          (let [slash-offset (+ offset char-width)]
            (System/arraycopy tail 0 all-data data-length (alength tail))
            (.removeContent xml-mod slash-offset char-width)
            (.insertBytesAt xml-mod (+ offset (* 3 char-width)) all-data))
          (.insertBytesAt xml-mod offset all-data))))
    (Navigator. (.outputAndReparse xml-mod))))

(defn replace-element! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav vtd-nav (.-nav nav)
        fr (.getElementFragment vtd-nav)
        offset (unchecked-int fr)
        len (unsigned-bit-shift-right fr 32)
        ^XMLModifier mod (.-mod mod)]
    (.removeContent mod offset len)
    (when data
      (.insertBytesAt mod offset (->bytes data)))))

(defn replace-content! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav vtd-nav (.-nav nav)
        fr (.getContentFragment vtd-nav)
        offset (unchecked-int fr)
        len (unsigned-bit-shift-right fr 32)
        ^XMLModifier mod (.-mod mod)]
    (.removeContent mod offset len)
    (when data
      (.insertBytesAt mod offset (->bytes data)))))

(defn insert-after-element! [^Modifier mod ^Navigator nav data]
  (let [[offset len] (element-fragment-offset+len (.-nav nav))]
    (add-insert! mod nav (+ offset len) data)))

(defn insert-before-element! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav mod-nav (.-nav nav)
        start-tag-index (.getCurrentIndex mod-nav)
        offset (-> (dec (.getTokenOffset mod-nav start-tag-index))
                   (update-offset-or-len-for-encoding (.getEncoding mod-nav)))]
    (add-insert! mod nav offset data)))

(defn insert-after-head! [^Modifier mod ^Navigator nav data]
  (let [offset (.getOffsetAfterHead ^VTDNav (.-nav nav))]
    (when (neg? offset)
      (throw (ex-info "Inserting content into empty elements is not supported" {:nav nav})))
    (add-insert! mod nav offset data)))

(let [m (doto (.getDeclaredMethod VTDNav "getOffsetBeforeTail" (into-array Class []))
          (.setAccessible true))
      a (into-array [])]
  ;; No clue why `VTDNav.getOffsetBeforeTail` is protected
  ;; when `VTDNav.getOffsetAfterHead` isn't.
  (defn -get-offset-before-tail [^VTDNav nav]
    (.invoke m nav a)))

(defn insert-before-tail! [^Modifier mod ^Navigator nav data]
  (let [offset (-get-offset-before-tail (.-nav nav))]
    (if (= offset -1)
      (insert-after-head! mod nav data)
      (add-insert! mod nav offset data))))

@mudge
Copy link
Owner

mudge commented Apr 28, 2023

Thanks for sharing that, @p-himik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants