-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client caused black screen BUG #118
Comments
As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs. |
I have a Corsair AX1600i PSU. The problem happens always, no matter if I use -gpu 0 -gpu 1 or only one gpu. |
You didn't specify which gpu you have and what backend is being used. No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help. |
I have 2x the same RTX 2080 Ti. Okay then I will use 2 clients instead of one. But I still have the same problem. When using chess guis I use: (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1) |
I asked about backend in case you were using the relatively new dx12, but I see you use cudnn-fp16 so this shouldn't be an issue. |
No this happens also when running run 1 or 2. |
I have the same problem when I use a power limit of 40%. |
When you run the client, the output near the top contains the exact lc0 command line used. Can you try this on its own to confirm the client has nothing to do with this? |
@borg323 On my machine it looks like this: |
Then the command to run would be: |
@borg323 Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 - Also I got a little taskmanager window with the information: lc0.exe doesn't work anymore. |
Probably you are not running lc0 from the same directory the client (and lc0) are in. I assume this is |
I have inside Z:\LC0 I want to donate gpus to -run 3. |
We appreciate it, but first we need to figure out what is causing the problem. Here is the procedure:
This will take you to the LC0 directory, and then run the command I gave earlier. I expect it will have the same problem We can then try to modify the command to see if we can isolate the issue. |
I have typed: Z:\LC0>Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle |
No crash after 8 hours with gpu 0. |
No crash after 8 hours with gpu 1. What does it mean and what to do next? |
Can you leave the above command running and open up a new command prompt and run the same command again but this time with |
I didn't see you trying these suggestions yet -- can you test whether the power spikes are still bad enough with lower parallelism to crash your PC? |
The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it. |
Or it may be some weird antivirus software reaction, having the same effect. The client doesn't do much more than downloading network files from the server, uploading results and running lc0. |
When I open cmd and use this: When I open cmd and use this: and then that: Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9 Then it looks like I have no problems. But the first is client.exe and the second is lc0.exe. I also tested gpu 0 and gpu 1 at the same time with two cmds and I have no problems when using lc0.exe. |
Is there a way I can check if the networking driver has crashed? |
Have you a line of code for me, how it should look like when using parallelism 4 and that: |
I'm using the 360 total security as antivirus software. |
And when I use to cmds: |
Z: Z: This works fine. Have someone any ideas what exactly caused the bug? |
by mooskagh, first post:
It's good to know that lower parallelism helps with stabilizing the power demand enough. We basically use parallelism to load the GPU more, but apparently that puts too much variation to the PSU. |
Lower parallelism doesn't helped. The PSU ins't to weak because it's the best PSU someone can buy for a lot of money on the market and it can be easy used with 4 GPUs. Any other ideas? |
PSUs do deterioriate so I wouldn't be so confident about it no matter what. Tensorflow and AI apps put a "spikey" load on it and the minute it exceeds a threshold, your CPU will shut down. I recently had a very bad experience where I can do many things just fine but trying to train a net it shuts down in 30 minutes. The PSU had maybe be a +200 extra watt on it but that didn't help. |
If you had crashes at 4, then higher values for parallelism are likely worse. What I don't see from this thread: Did you try starting two separate clients for the two GPUs with Still, the cause of your crashes 99% isn't software related, but comes from an apparently too unstable power demand of two GPUs, and the fact that your PSU is good doesn't necesessarily mean that it is good enough for this extreme scenario. |
When I start the client:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
it works fine for some minutes but then the client caused a black screen.
The machine is still running but no games are played.
No other things are possible and I need to restart the pc.
I have exactly the same gpus.
And no problems with fritz gui or chessbase 15 gui.
When running the client I see that it is using only one gpu and not both - how to fix this?
I can see it with msi afterburner and with gpu z.
Do I need parallelism?
Are there other things which I can also use with cmd?
Do we have something like logfile.txt when running the client?
The text was updated successfully, but these errors were encountered: