Client caused black screen BUG #118

lex312 · 2020-05-06T05:59:50Z

When I start the client:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)

it works fine for some minutes but then the client caused a black screen.
The machine is still running but no games are played.
No other things are possible and I need to restart the pc.

I have exactly the same gpus.
And no problems with fritz gui or chessbase 15 gui.

When running the client I see that it is using only one gpu and not both - how to fix this?
I can see it with msi afterburner and with gpu z.

Do I need parallelism?
Are there other things which I can also use with cmd?

Do we have something like logfile.txt when running the client?

mooskagh · 2020-05-06T15:26:49Z

As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs.
Both is quite common in dual GPU systems.

lex312 · 2020-05-06T15:34:53Z

I have a Corsair AX1600i PSU.
And lots of high end air cooler.
One GPU has only 74 degrees celsius.
The other one has 50.

The problem happens always, no matter if I use -gpu 0 -gpu 1 or only one gpu.

borg323 · 2020-05-06T16:08:40Z

You didn't specify which gpu you have and what backend is being used. No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.
Finally, note that the client only uses one gpu by default. Adding a second -gpu to the command line just overrides the first. If you want to run on both gpus, the most efficient way is to run a second client instance, with a different -gpu number.

lex312 · 2020-05-06T19:25:32Z

I have 2x the same RTX 2080 Ti.
I'm running the client with this:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
As you can see no backend and no parallelism.

Okay then I will use 2 clients instead of one. But I still have the same problem.
Also note that I think to most people don't know that the second gpu overrides the first gpu, when only one client is in use.
So this should be also fixed.

When using chess guis I use: (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
and roundrobin.

borg323 · 2020-05-06T19:59:56Z

I asked about backend in case you were using the relatively new dx12, but I see you use cudnn-fp16 so this shouldn't be an issue.
Is this only happening on run 3?

lex312 · 2020-05-06T20:47:17Z

No this happens also when running run 1 or 2.

lex312 · 2020-05-08T05:39:29Z

I have the same problem when I use a power limit of 40%.

borg323 · 2020-05-08T12:32:15Z

When you run the client, the output near the top contains the exact lc0 command line used. Can you try this on its own to confirm the client has nothing to do with this?
Example from an old log I had: /content/lc0/build/lc0 selfplay --backend-opts=backend=cudnn-fp16 --parallelism=32 --visits=10000 --cpuct=2.5 --cpuct-at-root=2.5 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.90 --temp-endgame=0.75 --temp-cutoff-move=16 --temp-visit-offset=-0.8 --fpu-strategy=absolute --fpu-value=-1.0 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000012 --policy-softmax-temp=1.2 --resign-wdlstyle=true --training=true --weights=client-cache/3eb9d62ecc6aa2a84b7cdb789c50702a02477cf969949cf7ed788b71a3ea9cfa

lex312 · 2020-05-08T13:04:04Z

@borg323
What exactly do you want me to do?

On my machine it looks like this:
Z:\LC0>client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name)
Lc0 client version 26
2020/05/08 15:00:09 lc0_main.go:956: serverParams: [--visits=10000 --cpuct=1.32
--cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --r
esign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60
--temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strate
gy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040
--policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alp
ha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=s
huffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slop
e=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0]
Args: [Z:\LC0/lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle
lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par
ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem
p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red
uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0
--minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=
true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg
n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-
left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --
moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57
96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9]
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Creating backend [multiplexing]...
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 10.1.0
GPU: GeForce RTX 2080 Ti
GPU memory: 11 Gb
GPU clock frequency: 1545 MHz
GPU compute capability: 7.5
PGN: [FEN "bnrnkbqr/pppppppp/8/8/8/8/PPPPPPPP/BNRNKBQR w KQkq - 0 1"]

borg323 · 2020-05-08T14:03:13Z

Then the command to run would be:
Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

lex312 · 2020-05-08T14:14:20Z

@borg323
I tried to run the command and got this:

Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 -
-parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-c
puct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0
.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-stra
tegy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-r
oot=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-w
dlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --ope
nings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2
--moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-facto
r=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fd
f4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Unhandled exception: Cannot read weights from client-cache\fdf4c93b5796723fd1ec8
8b09dcc92474a727a582ebf028ece402eb6fe50c3a9

Also I got a little taskmanager window with the information: lc0.exe doesn't work anymore.

borg323 · 2020-05-08T14:52:21Z

Probably you are not running lc0 from the same directory the client (and lc0) are in. I assume this is Z:\LC0. There should be books and client-cache subdirectories, the first one containing 960fen.pgn and the second one containing fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9.

lex312 · 2020-05-08T18:48:11Z

@borg323

I have inside Z:\LC0
lc0.exe and client.exe and the other basic lc0 things. Also the books and client-cache subdirectories are there. 960fen.pgn is inside books and inside client-cache I have the right fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

I want to donate gpus to -run 3.
That's why I open an empty cmd and pasted inside what you have wrote me before.

borg323 · 2020-05-08T19:18:39Z

We appreciate it, but first we need to figure out what is causing the problem. Here is the procedure:
Open a cmd window and then type:

Z:
CD \LC0

This will take you to the LC0 directory, and then run the command I gave earlier. I expect it will have the same problem We can then try to modify the command to see if we can isolate the issue.

lex312 · 2020-05-08T23:29:31Z

@borg323

I have typed:
Z:
CD \LC0
and then the command to run, which I've got from you.
I will tell you later when it crashes again.
This is how it looks now:

Z:\LC0>Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle
lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par
ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem
p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red
uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0
--minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=
true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg
n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-
left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --
moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57
96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Creating backend [multiplexing]...
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 10.1.0
GPU: GeForce RTX 2080 Ti
GPU memory: 11 Gb
GPU clock frequency: 1545 MHz
GPU compute capability: 7.5
gameready trainingfile Z:\LC0/data-hyepracghopp/game_000029.gz gameid 29 play_st
art_ply 0 player1 white result blackwon moves b2b3 g7g5 b1c3 f7f5 g2g4 f5f4 e2e3
b8c6 d2d4 e8g6 f1e2 e7e6 d1b2 d7d5 e2d2 c6b4 c1a1 g6c2 a2a3 c2d1 d2d1 b4c6 g1g3
from_fen rnknbqrb/pppppppp/8/8/8/8/PPPPPPPP/RNKNBQRB w KQkq - 0 1
tournamentstatus P1: +0 -1 =0 LOS: 15.87% P1-W: +0 -1 =0 P1-B: +0 -0 =0 npm 600.
875000 nodes 14421 moves 24
gameready trainingfile Z:\LC0/data-hyepracghopp/game_000001.gz gameid 1 play_sta
rt_ply 0 player1 white result whitewon moves d2d3 c7c6 f1g3 f8e6 b2b4 d8c7 e2e4
a7a5 b4b5 g8f6 e4e5 f6d5 a2a4 e8g8 g1f3 d7d6 e5d6 c7d6 e1g1 b7b6 f1e1 a8b7 g3f5
b7c7 e1e6 d6h2 from_fen qrbbknnr/pppppppp/8/8/8/8/PPPPPPPP/QRBBKNNR w KQkq - 0 1

lex312 · 2020-05-09T07:56:09Z

@borg323

No crash after 8 hours with gpu 0.
I'm running now gpu 1 for 8 hours.

lex312 · 2020-05-09T15:46:51Z

@borg323

No crash after 8 hours with gpu 1.

What does it mean and what to do next?

cn4750 · 2020-05-09T16:00:07Z

Can you leave the above command running and open up a new command prompt and run the same command again but this time with gpu=0 changed to gpu=1? This way you can test running two at the same time on your two GPUs?

Naphthalin · 2020-05-09T16:05:47Z

No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.

I didn't see you trying these suggestions yet -- can you test whether the power spikes are still bad enough with lower parallelism to crash your PC?

cn4750 · 2020-05-09T16:07:50Z

The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.

borg323 · 2020-05-09T16:27:08Z

The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.

Or it may be some weird antivirus software reaction, having the same effect. The client doesn't do much more than downloading network files from the server, uploading results and running lc0.

lex312 · 2020-05-09T16:28:11Z

@cn4750

When I open cmd and use this:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
Then the problem still happens.

When I open cmd and use this:
Z:
CD \LC0

and then that:

Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

Then it looks like I have no problems.

But the first is client.exe and the second is lc0.exe.
And I have no problems when using lc0.exe to play games or something using a gui.

I also tested gpu 0 and gpu 1 at the same time with two cmds and I have no problems when using lc0.exe.
But I have still the problem using the client.exe.

lex312 · 2020-05-09T16:32:56Z

@cn4750

Is there a way I can check if the networking driver has crashed?
Is it possible to upload less often???
I think the download from time to time should not be a problem but it looks to me like the gpus are producing extremly fast material to upload and upload and upload.
Maybe that's taken the client or the gpus with it.

lex312 · 2020-05-09T16:35:35Z

@Naphthalin

Have you a line of code for me, how it should look like when using parallelism 4 and that:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)

lex312 · 2020-05-09T16:40:19Z

@borg323

I'm using the 360 total security as antivirus software.
But the software would have asked me for a decision if it would found a virus or other things.

lex312 · 2020-05-09T17:05:36Z

And when I use to cmds:
client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name)
client -run 3 -gpu 1 -report-gpu -report-host -user (name) -password (name)
Then the problem still happens.

lex312 · 2020-05-10T20:04:10Z

Z:
cd Z:\LC0
client -run 3 -gpu 0 -report-gpu -report-host -parallelism=4 -user (name) -password (name)

Z:
cd Z:\LC0
client -run 3 -gpu 1 -report-gpu -report-host -parallelism=4 -user (name) -password (name)

This works fine.
I have no problems and no black screen after 9 hours of running both gpus.
The only difference is that I use here -parallelism=4.

Have someone any ideas what exactly caused the bug?
Can it be solved somehow or do I need to check parallelism from =5 to =31 too?

Naphthalin · 2020-05-10T20:19:36Z

Have someone any ideas what exactly caused the bug?

by mooskagh, first post:

As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs.
Both is quite common in dual GPU systems.

It's good to know that lower parallelism helps with stabilizing the power demand enough. We basically use parallelism to load the GPU more, but apparently that puts too much variation to the PSU.

lex312 · 2020-05-13T20:35:58Z

@Naphthalin

Lower parallelism doesn't helped.
I used 4, 8, 16, 17, 18, 19, 20, 21, 22, 23, 24, 32
I repeated also parallelism 4 and it crashed.
Sometimes it crashes after 30 minutes and sometimes it crashes after up to 13 hours and it doesn't matter what parallelism I'm using.

The PSU ins't to weak because it's the best PSU someone can buy for a lot of money on the market and it can be easy used with 4 GPUs.
There is also no poor cooling, because both GPUs have only 50 degrees celsius, when I decrease the power limit. The GPUs can also have 88 degrees celsius without problems.

Any other ideas?

dshawul · 2020-05-14T04:14:30Z

PSUs do deterioriate so I wouldn't be so confident about it no matter what. Tensorflow and AI apps put a "spikey" load on it and the minute it exceeds a threshold, your CPU will shut down. I recently had a very bad experience where I can do many things just fine but trying to train a net it shuts down in 30 minutes. The PSU had maybe be a +200 extra watt on it but that didn't help.
Your case maybe different but monitoring power usage right before it goes blank may give clues.

Naphthalin · 2020-05-14T08:37:14Z

If you had crashes at 4, then higher values for parallelism are likely worse.

What I don't see from this thread: Did you try starting two separate clients for the two GPUs with --parallelism=4 and experience the same crashes? I don't know the technical details of the client, as it is always recommended to start one client per GPU, but it could theoretically be that the client isn't as sophisticated when distributing jobs between several GPUs.

Still, the cause of your crashes 99% isn't software related, but comes from an apparently too unstable power demand of two GPUs, and the fact that your PSU is good doesn't necesessarily mean that it is good enough for this extreme scenario.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client caused black screen BUG #118

Client caused black screen BUG #118

lex312 commented May 6, 2020

mooskagh commented May 6, 2020

lex312 commented May 6, 2020

borg323 commented May 6, 2020

lex312 commented May 6, 2020

borg323 commented May 6, 2020

lex312 commented May 6, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020 •

edited

Loading

borg323 commented May 8, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020

lex312 commented May 9, 2020 •

edited

Loading

lex312 commented May 9, 2020

cn4750 commented May 9, 2020

Naphthalin commented May 9, 2020

cn4750 commented May 9, 2020

borg323 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 9, 2020 •

edited

Loading

lex312 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 10, 2020

Naphthalin commented May 10, 2020

lex312 commented May 13, 2020

dshawul commented May 14, 2020

Naphthalin commented May 14, 2020

Client caused black screen BUG #118

Client caused black screen BUG #118

Comments

lex312 commented May 6, 2020

mooskagh commented May 6, 2020

lex312 commented May 6, 2020

borg323 commented May 6, 2020

lex312 commented May 6, 2020

borg323 commented May 6, 2020

lex312 commented May 6, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020 • edited Loading

borg323 commented May 8, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020

borg323 commented May 8, 2020

lex312 commented May 8, 2020

lex312 commented May 9, 2020 • edited Loading

lex312 commented May 9, 2020

cn4750 commented May 9, 2020

Naphthalin commented May 9, 2020

cn4750 commented May 9, 2020

borg323 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 9, 2020 • edited Loading

lex312 commented May 9, 2020

lex312 commented May 9, 2020

lex312 commented May 10, 2020

Naphthalin commented May 10, 2020

lex312 commented May 13, 2020

dshawul commented May 14, 2020

Naphthalin commented May 14, 2020

lex312 commented May 8, 2020 •

edited

Loading

lex312 commented May 9, 2020 •

edited

Loading

lex312 commented May 9, 2020 •

edited

Loading