Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashconsed maps nodes #1

Merged
merged 33 commits into from
May 15, 2024
Merged

Hashconsed maps nodes #1

merged 33 commits into from
May 15, 2024

Conversation

dlesbre
Copy link
Collaborator

@dlesbre dlesbre commented Apr 25, 2024

Needs:

  • tests
  • doc (another example perhaps?).

@dlesbre dlesbre marked this pull request as draft April 25, 2024 17:14
patriciaTree.ml Outdated
| NBranch{prefix=prefixa;branching_bit=branching_bita;tree0=tree0a;tree1=tree1a;_},
NBranch{prefix=prefixb;branching_bit=branching_bitb;tree0=tree0b;tree1=tree1b;_} ->
prefixa == prefixb && branching_bita == branching_bitb &&
get_id tree0a = get_id tree0b && get_id tree1a = get_id tree1b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use == instead of get_id here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, because these values don't have the same type... This is also why I need an Obj.magic in the leaf case.

However, the more I think about it, the more I believe we should get rid of arbitrary value maps for hash-consing. It makes little sense to hashcons maps from 'a key to ('a, 'b) value together with map from 'a key to ('a, 'c) value. So I'll probably change it to use a 'a value type instead of the more generic ('a, 'b) value type.

Otherwise we might have problems with different types having equal values.

@dlesbre
Copy link
Collaborator Author

dlesbre commented Apr 26, 2024

So I've commited to the single value type for hash-consed version, this leads to a few interface changes from the previous version:

  • Renamed MakeCustom to MakeCustomMap, added new functor MakeCustomSet.
  • Renamed MakeCustomHeterogeneous to MakeCustomHeterogeneousMap, added new functor
    MakeCustomHeterogeneousSet.
  • add a second MAP interface, MAP_WITH_VALUE which changes 'a to 'a value,
  • MakeCustomMap changed to take a new argument to specify the 'a value type.
  • Added HashconsedNode, HashconsedSetNode as well as four functors to create
    hash-consed heterogeneous/homogeneous maps/sets: MakeHashconsedMap, MakeHashconsedSet,
    MakeHashconsedHeterogeneousMap and MakeHashconsedHeterogeneousSet.

I've also added test:

  • Turn the main test module into a functor to reuse with multiple map types
  • Add a specific test for hash consing, checks that physical equality is preserved no matter the order in which elements are added.

@dlesbre dlesbre marked this pull request as ready for review April 26, 2024 11:41
@dlesbre dlesbre changed the title Draft: hashconsed maps nodes Hashconsed maps nodes Apr 26, 2024
@mlemerre
Copy link
Contributor

I don't think we needed to change the existing interface. The following code should work on 0.9.0:

module HashConsedNode(Key:sig type 'a t end)(Value:VALUE):NODE_WITH_ID
  with type 'key key = 'key Key.t
   and type ('key,'map) value = ('key,'map) Value.t
= struct

  type 'a key = 'a Key.t
  type ('key,'map) value = ('key,'map) Value.t

  type 'map view =
    | Empty: 'map view
    | Branch: {prefix:intkey;branching_bit:mask;tree0:'map t;tree1:'map t} -> 'map view
    | Leaf: {key:'key key; value:('key,'map) value} -> 'map view
  and 'map t =
    | NEmpty: 'map t
    | NBranch: {prefix:intkey;branching_bit:mask;tree0:'map t;tree1:'map t;id:int} -> 'map t
    | NLeaf: {key:'key key;value:('key,'map) value;id:int} -> 'map t

  let view = function
    | NEmpty -> Empty
    | NBranch{prefix;branching_bit;tree0;tree1;_} -> Branch{prefix;branching_bit;tree0;tree1}
    | NLeaf{key;value;_} -> Leaf{key;value}

  let get_id = function
    | NEmpty -> 0
    | NBranch{id;_} -> id
    | NLeaf{id;_} -> id

  type any_map = Any_map: 'map t -> any_map [@@unboxed]
  (* We can't get rid of Obj.magic because of the float hack, which prevents unboxing here.  *)
  (* type any_key = Any_key: 'a Key.t -> any_key [@@unboxed] *)
  (* type any_value = Any_value: ('key,'map) Value.t -> any_value [@@unboxed]     *)

  module WeakHash = Weak.Make(struct
      type t = any_map
      let sdbm x y = y + (x lsl 16) + (x lsl 6) - x;;        
      let hash (Any_map x) = match x with
        | NEmpty -> 0
        | NLeaf{key=_;value=_;id=_} -> assert false (* XXX: We have
                                                       the int for the
                                                       key, but we may
                                                       want a hashing
                                                       function for
                                                       the value
                                                       too.  *)
        | NBranch{prefix;branching_bit;tree0;tree1;id=_} ->
          let hash = (get_id tree0) in
          let hash = sdbm (get_id tree1) hash in
          let hash = sdbm prefix hash in
          let hash = sdbm branching_bit hash in
          hash
      ;;
          
      let equal (Any_map a) (Any_map b) = match a,b with
        | NEmpty,NEmpty -> true
        | NLeaf{key=keya;value=valuea;id=_},NLeaf{key=keyb;value=valueb;id=_} ->
          (Obj.magic keya) == keyb && (Obj.magic valuea) == valueb
        | NBranch{prefix=pa;branching_bit=bba;tree0=tree0a;tree1=tree1a;id=_},
          NBranch{prefix=pb;branching_bit=bbb;tree0=tree0b;tree1=tree1b;id=_} ->
          pa == pb && bba == bbb && (Any_map tree0a) == (Any_map tree0b) && (Any_map tree1a) == (Any_map tree1b)
        | _ -> false
    end)

  let count = ref 1;;
  let hashtbl = WeakHash.create 17;;
  let empty = NEmpty
  let is_empty x = x == NEmpty
  let leaf (type k m) (key: k key) (value: (k,m) value): m t =
    let tentative = NLeaf{key;value;id= !count} in
    let any_tentative = Any_map tentative in
    let res = WeakHash.merge hashtbl any_tentative in
    if(any_tentative == res) then incr count; 
    let Any_map x = res in
      (* Obj.magic is fine, as the returned object has the same
         content than the tentative one. *)          
    let x:m t = Obj.magic x in
    x
  ;;

  let branch: type m. prefix:int -> branching_bit:int -> tree0:m t ->tree1:m t-> m t =
    fun ~prefix ~branching_bit ~tree0 ~tree1 ->
    match tree0,tree1 with
    | NEmpty, x -> x
    | x, NEmpty -> x
    | _ ->
      let tentative = NBranch{prefix;branching_bit;tree0;tree1;id=(!count)} in
      let any_tentative = Any_map tentative in
      let res = WeakHash.merge hashtbl any_tentative in
      if(any_tentative == res) then incr count; 
      let Any_map x = res in
      (* Obj.magic is fine, as the returned object has the same
         content than the tentative one. *)                
      let x:m t = Obj.magic x in
      x
end```

Some Obj.magic seem unavoidable, but they are all correct. We can further improve on this code by using 2 Hashtbl (and using GADT to distinguish the leaf and branch cases, to avoid pattern matching).

@mlemerre mlemerre marked this pull request as draft April 27, 2024 02:20
@dlesbre
Copy link
Collaborator Author

dlesbre commented Apr 27, 2024

No, any Obj.magic on values is unsafe, since different types can have equal representations:

# 97 == Obj.magic 'a';;
- : bool = true

@mlemerre
Copy link
Contributor

We need to clarify what safe means here.

let f a b = (Obj.magic a) == b is safe because:

  1. f cannot go wrong on any input (the untyped version has a well-defined behaviour)
  2. We cannot write any function g that would use f, which would type-check, but whose execution go wrong.

An exemple of a f function that does not have 1. but has the problem described in 2. is f: int -> int * int = fun x -> Obj.magic x.

The f function above does not have any of these problems, so is safe. It is similar to what you would do in Rust: create unsafe code that can be safely be called in code safe for the Rust type system.

Another way to be convinced that this code is safe to to consider the following fragment:

type any = Any: 'a -> any  [@@unboxed]
let f a b = (Any a) == (Any b)

The only reason why this code does not work is because of the float array hack; if it were removed, it should then work, and this code is equivalent to the f function above.

It is true that values of different types can have equal representation; which is why we want to hashcons values of different types that have the same memory representation.

@dlesbre dlesbre added the enhancement New feature or request label Apr 30, 2024
@dlesbre
Copy link
Collaborator Author

dlesbre commented Apr 30, 2024

Ok I see your point.

I guess one final objection I'd have is that this can result in trees of different types (but same representation, like leaf k 97 and leaf k 'a') being given the same identifier, which might be a problem as users will reasonably assume that identifiers are unique, when they are only unique within a given type.

On the other hand, the only things we really lose with the new interface are:

  • you need to instantiate a functor per map key-value pair in the hashconsed case
  • functions like map_no_share, mapi_no_share and slow_merge which are now restricted to the same value type on output (Although for the first two, you can use WithForeign.filter_map_no_share as a replacement, so slow_merge is the only one who can't be replicated).

So sure, it is a bit less user friendly, but I'll argue it also avoids confusion (different functor calls make it explicit the ids from different value types may clash).

@mlemerre
Copy link
Contributor

I guess one final objection I'd have is that this can result in trees of different types (but same representation, like leaf k 97 and leaf k 'a') being given the same identifier, which might be a problem as users will reasonably assume that identifiers are unique, when they are only unique within a given type.

We explicitly require different keys in the map to be mapped to different integers. So the problem would be a user that succeeds in putting in the map two keys with the same representation but different types. I don't think it is possible to write polyeq in this case. Well, I have trouble finding an example that would be problematic.

On the other hand, the only things we really lose with the new interface are:

* you need to instantiate a functor per map key-value pair in the hashconsed case

* functions like `map_no_share`, `mapi_no_share` and `slow_merge` which are now restricted to the same value type on output (Although for the first two, you can use `WithForeign.filter_map_no_share` as a replacement, so `slow_merge` is the only one who can't be replicated).

So sure, it is a bit less user friendly, but I'll argue it also avoids confusion (different functor calls make it explicit the ids from different value types may clash).

I don't see the confusion (I don't say there is none), so all I can see for now is the more complicated interface. If there is a possible legitimate confusion (or worse, a type safety bug), and if the change of interface is the best way to avoid it, then changing the interface could be a good idea (I don't think the change is a good idea is an improvement if there are no confusion). So, I propose that you write a problematic example first, and then we can see if avoiding this problem requires an interface change.

@dlesbre
Copy link
Collaborator Author

dlesbre commented May 2, 2024

My example shows the problem happening with using the same key k twice in two different maps. For a quick example where this can lead to surprising behavior:

module Key = HeterogeneousKeyFromKey(struct
  type t = int
  let to_int x = x
end)
module Node = HashconsedNode(Key)
module HMap = MakeCustomMap(Key)(Node)

let m1 = HMap.singleton 5 97 (* int HMap.t *)
let m2 = HMap.singleton 5 'a' 
(* char HMap.t, but with same id and physically equal to m1 *)

Using this, you can easily arrive at surprising behaviors:

module MapOfMaps = MakeMap(struct
  type t = Any : 'a HMap.t -> t
  let to_int (Any x) = Node.get_id x
end)

let m3 = MapOfMaps.of_list [ (m1, "foo"); (m2, "bar") ]
(* m3 has cardinal 1, the m1->foo binding has been overwritten.
   Users might reasonably assume there shouldn't be a conflict here *)

Furthermore, if users assume id's are unique, it becomes very easy to write unsafe code which looks safe:

type _ value =
  | OfInt : string -> int value
  | OfChar : unit -> char value

module HetMapOfMaps = MakeHeterogeneousMap(struct
  type 'a t = 'a HMap.t
  let to_int = Node.get_id
  let polyeq: type a b. a t -> b t -> (a, b) cmp = fun a b ->
    if to_int a = to_int b 
    then Obj.magic Eq (* This looks safe, if one assumes id are unique *)
    else Diff
end)(struct type ('a, 'b) t = 'a value end)

let m4 = HetMapOfMaps.of_list [ (m2, OfChar ()); (m1; OfInt "hello") ]
(* Same as m3, m4 only has a single binding *)
let error = HetMapOfMaps.get m2 m4
(* Typechecker thinks this is an [char value], with unit as only argument,
   it is in fact a [int value] with "hello" as only argument *)

let not_unit = match error with OfChar x -> x
(* This should be unit, but it isn't. The match will either fail
   or be optimized away (which allows creating a value of type unit which isn't ()... *)

This is quite a subtle issue. It is admittedly very rare (you need trees with same keys and physically equal values for it to crop up), but that also makes it very hard to debug when it does appear. I've used these sorts of unique id's into obj.magic for poly equality before, since it is generally a safe pattern. We can write documentation explaining the issue, but we can never be sure users have seen that warning or will remember it when the time counts. Another option is to rename get_id into something that suggest the id is only unique for a given type (get_type_unique_id? get_id_for_type?).

This cannot happen if we split Hashconsed maps by value as in the new interface (eg. require knowing the value type when instantiating a Hashconsed functor), since there is no single get_id function, there is one per value.

Also I should mention in case it isn't clear: the new interface only impacts the hashconsing functors, all other functors (non-hashconsed maps and sets) work just like before and don't exhibit any breaking changes.

@dlesbre
Copy link
Collaborator Author

dlesbre commented May 3, 2024

I've thought of another advantage of requiring explicit value types for hash-consed functors: it enables custom hash and equality functions on values.

In the fully generic version, the only hash we can use on values is Hashtbl.hash, and the only equality is == (I'm not sure if we can't use = safely behind an Obj.magic). This is quite restrictive, as physical equality may not be what we want on values:

let l1 = 5 :: 4 :: []
let l2 = 5 :: 4 :: [] (* equal to l1, but not physically equal *)

let m1 = HMap.singleton 5 l1 (* int list HMap.t *)
let m2 = HMap.singleton 5 l2 (* int list HMap.t *)
let b = HMap.equal m1 m2 (* false: values aren't physically equal *)

That seems like a pretty hefty limitation don't you think ?

@mlemerre
Copy link
Contributor

mlemerre commented May 3, 2024

I agree with this one. Some comments:

  • I am not sure if this would be a subject only for hash-consing (i.e., we might want equality between maps in general, using forall2 for instance). Probably nobody wants to compare maps without hash-consing, though.
  • Perhaps we want both: a parametric version, and a functorized version, like the standard library does for keys
  • I notice that I made a comment on my HashConsNode implementation indeed sayting that having a hash function for Value would be needed. Maybe we can just change my HashConsNode above to juste take a more restricted Value module, that would require an equality and hash function ? And provide a default parametric implementation using ==? This would avoid code duplication, and the interface change would be minimal.

@dlesbre
Copy link
Collaborator Author

dlesbre commented May 6, 2024

@mlemerre this should be ready for merge now, let me know what you think of the interface and included documentation. If it suits you, we'll merge and release.

@dlesbre dlesbre marked this pull request as ready for review May 6, 2024 15:04
@dlesbre dlesbre mentioned this pull request May 7, 2024
Merged
@dlesbre dlesbre added this to the 0.10.0 milestone May 8, 2024
@dlesbre dlesbre requested a review from mlemerre May 11, 2024 15:21
@dlesbre dlesbre merged commit a702db3 into main May 15, 2024
3 checks passed
@dlesbre dlesbre deleted the hashconsed-maps branch May 15, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants