-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a recursive call is not 100% reliably tail recursive, perhaps provoked by Arch #1006
Comments
Maybe same as #990 Machine on which unison is invoked is also an arch derivative. |
Yes, it appears to be the same issue as #990 (comment) @gdt, is it possible to move comments starting from #990 (comment) out of #990 and into this ticket here? It seems that the original #990 is a different issue but the linked comment and this ticket (#1006) are the same issue. If this is not possible then we'll just have to keep this slight mix-up in mind and treat #1006 as the main ticket for this issue going forward. |
To add one point here: I had these crashes with client and server on the same version of ocaml, not like the OP above. |
My current understanding is that this is a problem with OCaml 5.x (or maybe specifically 5.1.x). OCaml 5.x series is currently still considered experimental. Unison 2.53.4 compiled with OCaml 4.x does not seem to crash like this. |
I think this combination should work and I think the segfault is not related to this version mismatch. It may well be, but it is difficult to say without a proper backtrace. @callegar instead of the version built with OCaml 5.1.0, could you grab the build from Github releases page (which is built with 4.14), or build yourself with OCaml 4.12 or 4.13 or 4.14 and try again? That should work without any issues with version 2.51.5 (ocaml 4.13.1). This combination is expected to work for these OCaml versions, but all bets are off for 5.x series. If you can't try a different version and if you still get a segfault then could you get a backtrace from the crash? |
I think this should be retitled to "(probably) ocaml 5 miscompilation leads to segfault", at least for now, until we understand what is happening better. |
No, I can't move omments. |
I can, but it won't help me check it it fixes the issue, since the directory that was giving me problems has already been (painfully) synchronized manually with rsync.
So you suspect that the ocaml 5 compiler does not play well with unison? Isn't issue #990 more likely, given that I was synchronizing a directory that had nested inside a git repo with a lot of loose objects leading to a very large number of files? |
That is Toivo's suspicion as I understand it. My basic concept is to have one bug open per theorized defect, at least for now. If you or someone wants to reproduce problems, and preferably write a script that will reproduce them, we can test and look into it. I realize you just want to get past this -- and I would too, but not creating a repro recipe/script leads to "zero people currently have this problem". For now we have no examples of this segfault with mainstream ocaml, just new/experiental ocaml. |
More than this, the matter is that it is quite hard to pack a test case: when I encountered the bug it was on a largish tree, with a git repo inside with a lot of loose objects and data that I cannot freely share. What is really needed is a "synthetic" example that can be generated on demand, but that is not easy to do. In the meantime, if you suspect that ocaml 5 may break unison, probably it would be good to put a notice about this in evidence for distributions and packagers: my ocaml/unison combination comes from a linux distro and I suspect many of them do not consider ocaml 5 experimental and in absence of any strong motivation to keep it are dropping version 4 (fedora and derivatives, arch and derivatives, etc.) |
The upstream ocaml webpage says
so I don't think it's unison's place to have to warn about that. Given that statement, packagers should not have the 5.1 ocaml package be the main one. |
I do suspect these are different issues and there is at least one other user whose experience seems to confirm that. It is difficult to say without a crash trace or other information about your setup. If you still have the binary you ran when it crashed, could you attach it here so I can perhaps investigate further. Perhaps you still have the core dump from the crash lying around somewhere? If yes, please attach it here. If you can then please post some information about your configuration. We don't need the common preferences such as root, path, ignore, etc. But it would be good to know if you're using some lesser-used preferences such as atomic, maxthreads or stream. Could you estimate the number of updates that led to the crash? Finally, please clarify if you were using the text UI or the GUI. I do not think it is #990 because there are currently no known stack overflow bugs and I test with relatively small stacks (< 100KB whereas default on most systems is 8MB or even 10MB) and relatively large replicas (100 000+ updates, sometimes 1 000 000+ updates). |
Just noticed that the unison version provided in arch linux carries this downstream patch: That is a patch meant to fix #48 Is that still needed at all? Can it be the cause of the misbehavior? |
No, that patch is ancient. It can't even apply on the current code. |
I built the package without the patch and it is still crashing. So if it is the cause, it's not all of it. |
How do you know it is still crashing? I thought there was no reproducer. If you can reproduce the crash, can you extract a stack trace or a core dump from it? Can you pinpoint which conditions are required for the crash? |
I'm syncing the day's changes at the end of my workday and on most days it triggers the crash 😅. If that happens, I need to run unison multiple times until the crash to incrementally get all files synced.
Apparently systemd has captured a coredump for me. This is the backtrace unison-gui-backtrace.txt Note that this is the GUI, I haven't actually tried it with the CLI yet because the CLI isn't really practical for larger numbers of files. |
I see, I mixed up the reporters :) Thank you for the backtrace, this has been most helpful. The trace looks very much the same as the one @norbusan provided, so it most likely is the same issue and for now I'm still very much convinced that this is a OCaml 5.x issue (not to say that the bug still isn't in Unison code, just that it is not triggered by OCaml 4.x). What is the exact Unison version you are using? If you can trigger the crash so frequently then my first advice is to try and compile with OCaml 4.x or download from the GH releases page and then see if the crashes go away. |
It's the archlinux version, so
Yeah, makes sense. I was just not doing it by now because things have been kind of busy lately and I only ever remember about the crashes when they happen at the end of the day when I have other places to be and things to do 😆. But given that I have my own |
Yes, please do, that would be very helpful. Another thing, which presumably also comes from arch build scripts, is that both your and @norbusan's traces show that ocaml itself is stripped of debug symbols. If compiling with 5.1.1, is it possible to keep the debugging symbols in the compiler/compiler stdlib? |
@FSMaxB I have made an AUR package with the unison binaries from upstream (namely compiled with OCaml 4.14). Can you check that too? It is |
I have been trying to reproduce the issue myself but have been unsuccessful so far. @callegar @FSMaxB @norbusan if you have a binary that you know would crash, please attach it here so I can take a look at it. Right now it seems there are three unique reporters reporting the same exact issue (but not 100% sure because we didn't get a backtrace from one reporter). There are a couple of things common between the reporters:
Seeing as arch is the common theme here, I looked at its build recipe for ocaml-5 but couldn't really see anything suspicious. They do enable frame pointers, though. I also tried my own build with OCaml 5.1.1 with frame pointers enabled. Aside from making things run slower, still could not reproduce a crash. We really need your help to find out what's going on here, so I ask all the reporters for following:
Additionally, please provide as much information as possible about your setup:
Since I've been unable to reproduce the crash with binaries built with OCaml 5.x then I can only suspect that there is either something in your replicas/configuration that is a precondition for the crash or that there is something in the arch build process. It's not impossible that the arch build process may make the compiler produce different code (mainly due to the runtime itself being impacted); or a linker produce a different executable; or the problem could be in a linked library, such as lablgtk/GTK. |
Not sure if I succeeded in that, but here is another stacktrace: I just tried to reproduce without the GUI and wasn't able, but that doesn't necessarily mean anything because the GUI was previously run, making some progress before crashing so it might just have been brought below the threshold where it crashes. I will try with upstream release binaries next, but testing is slow because I only get to test every one or two workdays.
The replica is 511GiB with 1839500 files. But only a small portion of that changes usually. I don't know how far into the sync it crashes. But I think the amount of changed items is at least several thousands when it happens.
The only preferences I'm using are
Haven't tried a local sync, what crashes is a remote sync.
x86_64 I also recently removed all the archive files and tried again. This didn't fix anything, so it is definitely not a corrupted archive file. |
Ok, it also crashes when syncing locally! And I think I found a way to reproduce it locally. When I take our frontend project, jump between versions that are far apart and run Still need to find a step-by-step reproducible way that doesn't include proprietary data. One thing I'll try once I find the time (probably not today) is to maybe switch between versions in the linux kernel source tree, that might trigger it. |
I have to completely retract my previous comment (I have now deleted it because it has become misleading). I've tested with the following binaries (all of Unison 2.53.4):
I tested on a freshly installed Arch system and on openSUSE. The results:
"OK" means no crash. Simply "OK" without additional comments means no unwanted stack growth whatsoever. *** I see something weird that I can't explain at the moment. Many of the OCaml 5.x builds run the local sync correctly without crashing but when I extract the trace of the running process with gdb I can see a large number of frames. Large enough that it should easily crash with the limited stack size. Yet, no crashes... Can anyone explain what is going on here? New conclusions from this:
|
Update: I can only reproduce the crash with With my frontend-code test-case, the GUI crashes every time. But I was able to sync with the CLI no problem. |
It crashes with 5.1.1 btw. Although I probably won't be able to test a lot more today. |
@FSMaxB thank you for testing. What is interesting is that I have not been able to crash a remote sync, even if I limit the stack size to the smallest possible with a GTK application. Remote syncs do not show any signs of stack exhaustion. I have finally been able to make a local sync crash and can now investigate further. |
That is weird, the unison binary I get from Arch claims it's been compiled with 5.1.0. Could this be a difference in Arch mirrors? (I know nothing about Arch, so there's that...) I'd like to test with Arch 5.1.1 binary if you can instruct me how I can get it. |
Oh, sorry about that. I totally forgot that this was still the binary I produced to try to get more debug information. I took the archlinux PKGBUILD, removed the patch and just rebuilt it locally. https://gitlab.archlinux.org/archlinux/packaging/packages/unison This is then automatically using OCaml 5.1.1. BUT: I can confirm that the crashes don't happen with |
Last test for today, the binary from this link: https://github.com/tleedjarv/unison/actions/runs/8481008964 This does not crash. So it seems like the crashes, for me at least are actually only happening with Archlinux produced binaries 👀 |
Thank you @FSMaxB. Your testing confirms my findings that Arch-built binaries do seem to behave differently. |
I have a patch that I hope will resolve the crashes that I've been able to reproduce (and hopefully those seen by each reporter). @callegar @FSMaxB if you are building from source then please apply this patch: tleedjarv@b186791 Alternatively, you can also test a CI-built binary that you can download from https://github.com/tleedjarv/unison/actions/runs/8498976286 (that's a clean 2.53.4 with only this patch applied). If you have a hard time reproducing the crash then try to limit the stack size (for me 144 KB for the GUI worked, for example). That should limit the number of updates required to trigger the crash. You can also artificially create a large number of updates to sync (just empty files, for example). Clarifications on some previous comments:
Yes, the patch in PKGBUILD (largefiles) is bogus and should be removed. Also, PKGBUILD lists
Ok, so we know it still crashes with 5.1.1 but the binary built with 5.1.1 is much faster. Let's see if the patch fixes the crash. |
@tleedjarv , looking at your comment on the stack size, are you limiting the OCaml stack size with |
Ah, that explains the results I got! Thank you for pointing it out.
I need to retest with |
That part of the manual has indeed bitrotten, this will be fixed in the 5.2.0 manual (see ocaml/ocaml#13066). |
I have now tested the tentative fix with |
Unfortunately, I am under a deadline and I will not be able to test before next week. Hope someone is faster than me. Thanks for all the analysis and fix work! |
I propose this ticket can be closed once #1016 is merged even if all testing on Arch is not finished. If the crashes persist then it can be reopened. In conclusion:
|
Thanks for looking into this. I've dropped the patch (originally from Fedora) that Arch Linux was carrying to fix problems you claim are now obsolete. That rebuild is now in the official [extra] repo. If there are ongoing issues with performance could somebody please open up a new issue either here (and ping me) or on the Arch Linux packaging repo specifically about that? If we're doing something wrong with build flags and optimizations I'd be happy to get it fixed. I'm not very familiar with OCAML but can certainly help with getting things changed if anybody knows what to do. If build logs showing various flags or whatever would be helpful ask in an issue and I'll see what I can provide. Also all the tooling for building reproducible copies of Arch Linux packages and then playing with your own build settings is all open source and pretty easy to setup. I can help somebody with that if you want to try to replicate build issues. |
I had a quick look at the PKGBUILD and I don't see anything there that could have caused the performance issue. It may have been due to 5.1.0 compiler having some regression. If I understand correctly, the new builds are compiled with 5.1.1 (my own build on Arch with 5.1.1 did not seem to suffer from performance issues). I had suspected that this may have been a bytecode build, but at least in my testing, it was even slower than a bytecode build! @alerque is there a way to see the build logs somewhere? @alerque A side remark about PKGBUILD: you can remove emacs as build dependency, you can remove For now, I would like to think that this was some regression with the 5.1.0 compiler and doesn't require any further attention, but I will also test the new binary from the rebuild you mentioned. |
There was an important GC performance regression between 5.0.0 and 5.1.0 which was fixed in 5.1.1. The regression was due to a bad interaction between the GC and bigarrays or C-allocated objects, I don't know if this applies to unison, but this is at least a likely culprit. |
Yes, it does apply. This surely must have been it. |
Thanks again @tleedjarv. I've just rebuilt again with those tips in place. That's looking nicer all the time. You can see pkgrel 4. @Octachron I'd like to think that's probably the issue, but it would be nice to confirm it is not present in current builds. @tleedjarv which pkgrel were you testing that had speed issues? We can check the archives and look inside the pkg.zst file itself for the .BUILDINFO file to see exactly what version of OCAML it was compiled with. The actual build logs are ephemeral unless the person doing the building stashes them somewhere, but we can see rebuild logs from when rebuilderd tries to see if the packages are reproducible. Here is the dashboard. Since I just sent a new package version it isn't checked yet, but it should show up soon at which point the build logs should be visible. |
I tested pkgrel 2, built with 5.1.0, slow. Now I tested pkgrel 4, built with 5.1.1, performance is ok. I think this concludes the performance issue, so we no longer need the build logs. Edit: Just checked the rebuild logs for pkgrel 4 and all looks good there. |
Using unison version 2.53.4 (ocaml 5.1.0), I need to sync with an host running unison version 2.51.5 (ocaml 4.13.1).
It is unclear to me if this combination is supported at all. Unison does not complain, though.
The changes are collected just fine.
When I tell unison to propagate the changes it starts, but at about 50% it crashes, with a segmentation fault. Even if the version combination above is weird, IMHO unison should at most refuse to run, not segfault.
Unfortunately I cannot install a more recent unison on the target machine to test with it.
Turning on debugging, shows
[files] Creating directory <long path>
as the last action before the crash. From the path, unison is trying to transfer a piece of a.git
directory (in fact an object directory).Because of how git objects are named, I suspect a rather long path.
The text was updated successfully, but these errors were encountered: