Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Table (hash table) implementation (and potential GC bug) #144

Open
srdjanstipic opened this issue Oct 2, 2024 · 2 comments
Open

Comments

@srdjanstipic
Copy link

srdjanstipic commented Oct 2, 2024

I was playing with libCello (git commit: dfcd86c)
and I found a strange behaviour in Table implementation.
When the table resizes, some values are modified/lost (check the lines with // BUG).

Also, it looks like GC is taking forever (O(n^2) or more) as the PROBLEM SIZE increases.

The code to reproduce the issue is here:

#include <assert.h>
#include "Cello.h"

// count word frequency
int main(int argc, char **argv) {
  (void)argc, (void)argv;
  // wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
  FILE *fp =  fopen("./t8.shakespeare.txt", "r");
  char *line = NULL, *sep = " \f\n\r\t\v";
  var d = new (Table, String, Int), the = $S("the");
  int i = 0, the_count = 0;
  for (size_t len = 0; getline(&line, &len, fp) != EOF;) {
    for (char *tmp = line, *word; (word = strsep(&tmp, sep));) {
      if (i % 10000 == 0) {
        printf("%d\n", i);
      }
      if (i++ > 300000) { // ******* PROBLEM SIZE *********
        goto end;
      }
      the_count += strcmp(*(char**)the, word) == 0;
      if (!mem(d, $S(word))) {    // create if missing
        set(d, $S(word), $I(0));
      }
      *(long *)get(d, $S(word)) += 1; // OK, time 0.127s
      // set(d, $S(word), $I(c_int(get(d, $S(word))) + 1)); // BUG, time 0.261s
      // set(d, $S(word), new(Int, $I(c_int(get(d, $S(word))) + 1))); // BUG, SLOW, time 32.854s
    }
  }
end:
  if (line) {
    free(line);
  }
  if (fp) {
    fclose(fp);
  }
  println("%$ %$\n", $I(the_count), get(d, the));
  assert(the_count == c_int(get(d, the))); // FAIL for BUGs
  return 0;
}
@awmorgan
Copy link
Contributor

awmorgan commented Dec 7, 2024

@orangeduck There appears to be a bug in Table.c because for this simplified test case I get unexpected results:

#include "Cello.h"
int main(int argc, char *argv[]) {

  var d = new (Table, String, Int);
  set(d, $S("a"), $I(1));
  set(d, $S("b"), $I(1));
  set(d, $S("c"), $I(1));
  set(d, $S("d"), $I(1));
  set(d, $S("a"), $I(2));
  set(d, $S("b"), $I(2));
  set(d, $S("c"), $I(2));
  set(d, $S("d"), $I(2));

  foreach (w in d) {
    println("%$ %$", w, get(d, w));
  }
  return 0;
}

output shows duplicated keys "b" and "d":

"b" 1
"d" 1
"c" 2
"d" 2
"b" 2
"a" 2

however, changing to use assign for the existing keys that were added in first pass works:

#include "Cello.h"
int main(int argc, char *argv[]) {

  var d = new (Table, String, Int);
  set(d, $S("a"), $I(1));
  set(d, $S("b"), $I(1));
  set(d, $S("c"), $I(1));
  set(d, $S("d"), $I(1));
  assign(get(d, $S("a")), $I(2));
  assign(get(d, $S("b")), $I(2));
  assign(get(d, $S("c")), $I(2));
  assign(get(d, $S("d")), $I(2));

  foreach (w in d) {
    println("%$ %$", w, get(d, w));
  }
  return 0;
}

output is correct:

"b" 2
"c" 2
"a" 2
"d" 2

This change seems to work but is probably covering up a deeper issue.

 static void Table_Set(var self, var key, var val) {
+  if (mem(self, key)) {
+    assign(get(self, key), val);
+    return;
+  }
   Table_Set_Move(self, key, val, false);
   Table_Resize_More(self);
 }

@orangeduck
Copy link
Owner

Yeah looks like there is a bug somewhere...

I wrote this code almost 10 years ago so unfortunately I have no real memory of how it works or what might be wrong. All I really remember is that it is using robin hood hashing, which is also used by Dictionary and the Garbage Collector.

Given it seems this bug is specific to Table (and does not appear in Dictionary or the GC as far as I know) it is probably somehow related to the fact that Table is meant to be like an owning container - so it might not be moving the contents properly during "rehashing" or "set".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants