Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator cause hasher return different hash for the same key #31

Open
zhenzou opened this issue Dec 6, 2023 · 14 comments
Open

Validator cause hasher return different hash for the same key #31

zhenzou opened this issue Dec 6, 2023 · 14 comments

Comments

@zhenzou
Copy link

zhenzou commented Dec 6, 2023

Golang: 1.21.4

And it will be ok if remove the validate step.

It maybe not the problem of the theine, but would you like give some suggestions to solve this problem, Validator is a wild used library, many other uses maybe face the same issues.

package main

import (
	"slices"
	"unsafe"

	"github.com/go-playground/validator/v10"
	"github.com/zeebo/xxh3"
)

type Hasher[K comparable] struct {
	ksize int
	kstr  bool
}

func NewHasher[K comparable]() *Hasher[K] {
	h := &Hasher[K]{}
	var k K
	switch ((interface{})(k)).(type) {
	case string:
		h.kstr = true
	default:
		h.ksize = int(unsafe.Sizeof(k))
	}
	return h
}

func (h *Hasher[K]) Hash(key K) uint64 {
	var strKey string
	if h.kstr {
		strKey = *(*string)(unsafe.Pointer(&key))
	} else {
		strKey = *(*string)(unsafe.Pointer(&struct {
			data unsafe.Pointer
			len  int
		}{unsafe.Pointer(&key), h.ksize}))
	}
	return xxh3.HashString(strKey)
}

// For simulate real case, instance a new string every time
func newArgs(id string) Args {
	return Args{
		Id: string(slices.Clone([]byte(id))),
		// Id: id,
	}
}

type Args struct {
	Id string `url:"id" validate:"required"`
}

func main() {
	hasher := NewHasher[Args]()

	validate := validator.New()
	for i := 0; i < 100; i++ {
		args := newArgs("123")
		err := validate.Struct(args)
		if err != nil {
			panic(err)
		}
                // expect to same for all 100 times, but not
		println(hasher.Hash(args))
	}
}
@maypok86
Copy link

maypok86 commented Dec 6, 2023

Spoiler: it's actually a problem for theine as well. Since a key containing strings will select a random shard and hit ratio will be very small.

@maypok86
Copy link

maypok86 commented Dec 6, 2023

If you just need a hasher for comparable types, I can recommend https://github.com/dolthub/maphash that uses a set of dirty tricks to pull a hash function from a standard map

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 7, 2023

@zhenzou as @maypok86 pointed out, this is not related to validator. This is because current hash function of Theine calculate hash based on object's memory bytes. String is pointer to bytes so two strings have different hash even their value is same . I copy this from https://github.com/tidwall/hashmap because first version of Theine use that as internal map.

I think the best way to handle this is adding an optional Hash interface, instead of using the maphash package which hack Go's runtime a lot. Let me do some investigations first and decide how to handle this properly

@zhenzou
Copy link
Author

zhenzou commented Dec 7, 2023

@Yiling-J Yes, I don't believe it is issue of validator either. The strange thing is the issue will be disappears if remove the validate step. In my opinion, I create a new string every time, it should be different every time for current version even removed the validate step.
I tried some escape demos, but none of them can reproduce the issue.
Would you help to explain the detail for this.
Thanks a lot.

func escape(obj any) {
	if obj == nil {
		panic("nil")
	}
	val := reflect.ValueOf(obj)

	if val.Kind() != reflect.Struct {
		panic("not struct")
	}
}

@zhenzou
Copy link
Author

zhenzou commented Dec 7, 2023

@Yiling-J Hi, We are using theine in our new project, and this issue block us, would you provide a workaround for this.
e.g. the add the interface to build a key to return a string, the hasher recognize if the key is a CacheKey.

type CacheKey interface{
    BuildCacheKey() string
}

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 7, 2023

@zhenzou I do a quick fix on master branch so you are not blocked. I still need some investigations before release a new version. About your validator question, I think "123" is interning automatically by Go(in string() method), and validator somehow break this and create a new string each time string() is called.

new interface:

type StringKey interface {
	StringKey() string
}

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 7, 2023

@zhenzou Actually there might be the case your key struct already has a StringKey method and do something else. So what i think is adding an extra method to builder, so the options are:

  • still use StringKey() string interface
  • use StringKey() theine.Key interface, in case StringKey method already exists and do something else
  • add an optional method to client builder, so no need to worry about conflicts
builder := theine.NewBuilder[Foo, string](1000)

builder.StringKey(func(key Foo) string {
    return key.name
})

which one you think is better?

@zhenzou
Copy link
Author

zhenzou commented Dec 7, 2023

@Yiling-J You are correct, add a custom StringKey builder instead of StringKey interface is a better solution.

@maypok86
Copy link

maypok86 commented Dec 7, 2023

@Yiling-J I noted maphash because all other solutions will raise the same question, "What will you do if the key contains multiple strings?".

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 7, 2023

@zhenzou main branch updated, please use the new builder method:

builder := theine.NewBuilder[Foo, int](10000)
builder.StringKey(func(k Foo) string { return k.Bar })

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 7, 2023

@maypok86 seems the only special type as map key is string? because bytes is not allowed in comparable and pointers are always not equal. if so it's also possible to do a recursive check with reflect when hasher initialized, if there is string field and user doesn't provide a custom StringKey func(or maybe hash function), just panic.

@maypok86
Copy link

maypok86 commented Dec 7, 2023

@Yiling-J Eh, if only it were that simple...

Since Go 1.20 interfaces are comparable and this example will not pass:

package kek

import (
	"fmt"
	"github.com/dolthub/maphash"
	"github.com/zeebo/xxh3"
	"testing"
	"unsafe"
)

type Hasher[K comparable] struct {
	ksize int
	kstr  bool
}

func NewHasher[K comparable]() *Hasher[K] {
	h := &Hasher[K]{}
	var k K
	switch ((interface{})(k)).(type) {
	case string:
		h.kstr = true
	default:
		h.ksize = int(unsafe.Sizeof(k))
	}
	return h
}

func (h *Hasher[K]) Hash(key K) uint64 {
	var strKey string
	if h.kstr {
		strKey = *(*string)(unsafe.Pointer(&key))
	} else {
		strKey = *(*string)(unsafe.Pointer(&struct {
			data unsafe.Pointer
			len  int
		}{unsafe.Pointer(&key), h.ksize}))
	}
	return xxh3.HashString(strKey)
}

type LolKek struct {
	lol any
	kek any
}

func TestHasher(t *testing.T) {
	lol := 236
	kek := [34]byte{1, 2, 3}
	hasher := NewHasher[LolKek]()
	h1 := hasher.Hash(LolKek{
		lol: lol,
		kek: kek,
	})
	fmt.Println(h1)
	h2 := hasher.Hash(LolKek{
		lol: lol,
		kek: kek,
	})
	fmt.Println(h2)
	if h1 != h2 {
		t.Fatal("hashes should be the same")
	}
}

func TestMapHash(t *testing.T) {
	lol := 236
	kek := [34]byte{1, 2, 3}
	hasher := maphash.NewHasher[LolKek]()
	h1 := hasher.Hash(LolKek{
		lol: lol,
		kek: kek,
	})
	fmt.Println(h1)
	h2 := hasher.Hash(LolKek{
		lol: lol,
		kek: kek,
	})
	fmt.Println(h2)
	if h1 != h2 {
		t.Fatal("hashes should be the same")
	}
}

But it's more rare than this one (LolKek is just a composite key):

type LolKek struct {
	LolID string
	KekID string
}

@tv42
Copy link

tv42 commented Dec 31, 2023

If you're heading toward the route of needing StringKey functions, you should probably stop insisting on comparable. If I can make my StringKey function return a random number, that comparable didn't actually enforce the keys to be comparable.

I'd really prefer to see an API that doesn't force me to combine multiple []byte into a newly-allocated single string -- especially multiple times! And my loading function wants to access the fields separately, so encoding them into a string would just mean I have to decode the string to fetch the value.

I haven't looked at the internals of theine to know if this is feasible, but one way around this would be to say that a cache key must implement Equals(other: K) bool and Hash(sum *maphash.Hash) (https://pkg.go.dev/hash/maphash). That way you can do e.g.

package main

import (
	"bytes"
	"encoding/binary"
	"hash/maphash"
)

type cacheKey struct {
	foo  uint64
	bar  []byte
	quux []byte
}

func (c cacheKey) Equals(other cacheKey) bool {
	return c.foo == other.foo && bytes.Equal(c.bar, other.bar) && bytes.Equal(c.quux, other.quux)
}

func (c cacheKey) Hash(sum *maphash.Hash) {
	var tmp [8]byte
	binary.LittleEndian.PutUint64(tmp[:], c.foo)
	sum.Write(tmp[:])
	sum.Write(c.bar)
	sum.Write(c.quux)
}

@Yiling-J
Copy link
Owner

@tv42 I think comparable is the right key type, and if we consider cache a hashmap, it's nature to use comparable as key. Go knows exactly how to handle comparable, it just didn't export interface or useful functions. The best 2 options to me is: Go provide a generic hash function or Go make hashmap key type implementable, like Rust.

Currently I'm still thinking do a recursive reflect check when cache initilaized, if there is string and StringKey method is not implemented, just panic(any is valid in this check though, but when saving to real map, Go will panic). Another option is rewrite and avoid sharded map, because the hash value is only used to choose a shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants