Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Regular expressions #498

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open

[wip] Regular expressions #498

wants to merge 34 commits into from

Conversation

mks-m
Copy link
Contributor

@mks-m mks-m commented Feb 18, 2016

first of all, this is not merge-ready. this PR introduces regexp literal and builds cre2 C library that wraps google's re2. to get all of it working you need:

  • build pixie-vm normally
  • make re2_cre2 - this will build re2 and cre2 wrapper in externals/
  • ln -s 'pwd'/externals/cre2/build/.libs/libcre2* /usr/local/lib/ - i have no idea what i'm doing
  • pixie-vm -c pixie/regex.pxi - this might require something like export DYLD_LIBRARY_PATH='pwd'/externals/cre2/build/.libs, see previous point

after this is done, fire up your repl and run (ns user (:require [pixie.regex :refer [regexp match]]))

this PR introduces regexp literal like in clojure:

#"123" - is translated in RegexReader into (regexp "123" #{})
#"123"iL - is translated into(regexp "123" #{:ignore_case :literal})

most options available in re2 are supported

user => (match #"123" "123")
1
user => (match #"123" "124")
0

this is how far i got and i will certainly need help from somebody more knowledgable in building and using C libraries.

@mpenet
Copy link
Contributor

mpenet commented Feb 18, 2016

Interesting work!

One thing that (I believe) is problematic with this approach is the "freeing" of resources. You are never calling cre2_delete or cre2_opt_delete, you leave that to the user (which is fine, there's no other way when doing ffi for now). http://marcomaggi.github.io/docs/cre2.html#matching

It seems that one idiomatic way of handling this in pixie would be to wrap regex (and options) instances in a type and implement IDisposable (a bit like here:

pixie/pixie/io/tcp.pxi

Lines 9 to 14 in 189395f

(defrecord TCPServer [ip port on-connect uv-server bind-addr on-connection-cb]
IDisposable
(-dispose! [this]
(uv/uv_close uv-server st/close_cb)
(dispose! @on-connection-cb)
(dispose! bind-addr)))
).

The user still has to release resources manually but at least it's a bit more explicit/obvious with this protocol, and I guess down the road pixie could/might handle this internally once no ref to an instance remains (this is probably the idea behind IDisposable).

@mpenet
Copy link
Contributor

mpenet commented Feb 18, 2016

One could also imagine we'd have a generic regex protocol, that cre2 would implement with that type I mention. That would unify a bit the interface for if/when people make other implementations (pcre2 etc).

@mks-m
Copy link
Contributor Author

mks-m commented Feb 18, 2016

@mpenet you're absolutely right about freeing resources, it's one large problem i was hoping somebody would point me to a decent solution (which you did, thanks!). i imagine it should also be possible to somehow hook into GC and tell which function to use to free certain objects?

protocol is a good idea as well, i wasn't focusing much on how to do everything properly. one problem with different regexp engines is how to deal with "late binding" of pixie.regex/regexp function that implements literal conversion into native object? if somebody refers to re2 and passes literal into namespace that refers pcre - that obviously won't work.

@mpenet
Copy link
Contributor

mpenet commented Feb 18, 2016

Right, I didn't think about the literal, when you expand #"" you can just translate into a fn call that does the right thing, which would use the default active re engine (that can be set in a global var). and have that re-pattern fn have 2 signature with one that allows to specify another engine.

Something like this (dont mind the awful naming):

(def ^:dynamic *default-re-engine* :cre2)

;; an "open" engine registry
(defmulti re-engine (fn [k s] k))

;; add cre2 to registry
(defmethod re-engine :cre2 [_ s] 
  (new cre2-re-engine s))

;; dispatch on the right engine constructor via registry
(defn re-pattern 
   ([s] (re-pattern s *default-re-engine*))
   ([s kw] (re-engine kw s)))

(defprotocol IRegex 
  ... 
  (re-find [r s])
  (re-seq [r s]))


(deftype cre2-re-engine [...] 
IRegex 
... 

IDisposable 
... ))

Just thinking out loud, there might be better approaches, but that way you can change impl using binding, a redef of the default-re-engine or simply by passing the engine kw to re-pattern manually)

@halgari
Copy link
Member

halgari commented Feb 18, 2016

You're right IDispose doesn't hook into the GC, but IFinalize does (with
the associated -finalize! function). With a caveat: the GC can't look
outside of its own heap, so if you enable -finalize! on and that object is
stored outside the semantic bounds of a function call, then you may get a
segfault. Pixie makes sure to to never GC arguments to a FFI function while
that function is being called, but if a argument is stashed away somewhere
on the C heap, the GC may try to free it after the call is finished.

All that to say, I think IFinalize should work just fine for regexes.

On Thu, Feb 18, 2016 at 5:43 AM, Max Penet [email protected] wrote:

Right, I didn't think about the literal, when you expand #"" you can just
translate into a fn call that does the right thing, which would use the
default active re engine (that can be set in a global var). and have that
re-pattern fn have 2 signature with one that allows to specify another
engine.

Something like this (dont mind the awful naming):

(def default-re-engine :cre2)
(set-dynamic! default-re-engine)
;; an "open" engine registry
(defmulti re-engine (fn [k opts])
;; add cre2 to registry
(defmethod re-engine :cre2 [_ opts](new cre2-re-engine opts))
;; dispatch on the right engine constructor via registry
(defn re-pattern
([s](re-pattern s default-re-engine)
([s kw](init-re-engine kw))

(defprotocol IRegex
...
(re-find [r s])
(re-seq [r s]))

(deftype cre2-re-engine [...]
IRegex
...

IDisposable
... ))

Just thinking out loud, there might be better approaches.


Reply to this email directly or view it on GitHub
#498 (comment).

“One of the main causes of the fall of the Roman Empire was that–lacking
zero–they had no way to indicate successful termination of their C
programs.”
(Robert Firth)

@mpenet
Copy link
Contributor

mpenet commented Feb 18, 2016

@halgari awesome, I didn't know about IFinalize, thanks for the tip!

@mks-m
Copy link
Contributor Author

mks-m commented Feb 19, 2016

@mpenet @halgari thanks for insights, i've added deftype and defprotocol and multimethod dispatch to current (only) engine.

i've also changed literal expansion into (pixie.re/regex ...) - which means to have literals working one needs to always refer to pixie.re namespace. what do you think is best way to address this? maybe (in-ns pixie.stdlib) just before (defn regex ...) so that regex is always in stdlib?

i think expected functionality is for literals to be available in any part of codebase but not sure how to achieve that properly.

@mks-m mks-m force-pushed the re2 branch 3 times, most recently from 927c66f to ecd68b9 Compare February 19, 2016 10:49
@mks-m
Copy link
Contributor Author

mks-m commented Feb 25, 2016

bit stuck with alloc'ing pointer to / array of cre2_string_t structs that will contain capture groups

example usage in cre2 library: https://github.com/marcomaggi/cre2/blob/master/src/cre2.cpp#L222

how do i do that in pixie?

right now i'm just doing this: 4c61558#diff-a532eb022eaed3bf076b7d519289551aR63

@thomasmulvaney
Copy link
Member

I'm not sure if there is a good way for allocating a pointer to an array in pixie at the moment. I don't think there is a nice way of creating an array for a given struct/type either. Perhaps we could start a list of FFI things to work on as a github issue?

@mks-m
Copy link
Contributor Author

mks-m commented Mar 15, 2016

@thomasmulvaney i'm afraid my ffi-fu is not strong enough to properly describe what are we missing using the right language but i would like to put more effort into this pr if it has any chance of being merged. will appreciate any help with ffi things even if it is just issue description and maybe some links to get me started.

@thomasmulvaney
Copy link
Member

@keymone I pulled down this branch last night. Nice work!

To extract the strings from the matches we need an ffi function which when given a CharP and a length returns a pixie string which is the result of reading from the pointer to length.

I also noticed that ffi-infer thinks the :data part of the cre2_string_t is a VoidP rather than a CharP so we should add a function to ffi-infer that lets one cast between pointer types.

I have had a look at implementing these 2 functions.

@thomasmulvaney
Copy link
Member

I've made some progress with handling charps: master...thomasmulvaney:charp

Currently this seems to only work when run interactively. When compiled, charp->str crashes. I'm guessing because whatever part of memory the pointer was addressing is no longer a char. I'm not sure if I'll have much time to look into this further until next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants