Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client caused black screen BUG #118

Open
lex312 opened this issue May 6, 2020 · 31 comments
Open

Client caused black screen BUG #118

lex312 opened this issue May 6, 2020 · 31 comments

Comments

@lex312
Copy link

lex312 commented May 6, 2020

When I start the client:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)

it works fine for some minutes but then the client caused a black screen.
The machine is still running but no games are played.
No other things are possible and I need to restart the pc.

I have exactly the same gpus.
And no problems with fritz gui or chessbase 15 gui.

When running the client I see that it is using only one gpu and not both - how to fix this?
I can see it with msi afterburner and with gpu z.

Do I need parallelism?
Are there other things which I can also use with cmd?

Do we have something like logfile.txt when running the client?

@mooskagh
Copy link
Member

mooskagh commented May 6, 2020

As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs.
Both is quite common in dual GPU systems.

@lex312
Copy link
Author

lex312 commented May 6, 2020

I have a Corsair AX1600i PSU.
And lots of high end air cooler.
One GPU has only 74 degrees celsius.
The other one has 50.

The problem happens always, no matter if I use -gpu 0 -gpu 1 or only one gpu.

@borg323
Copy link
Member

borg323 commented May 6, 2020

You didn't specify which gpu you have and what backend is being used. No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.
Finally, note that the client only uses one gpu by default. Adding a second -gpu to the command line just overrides the first. If you want to run on both gpus, the most efficient way is to run a second client instance, with a different -gpu number.

@lex312
Copy link
Author

lex312 commented May 6, 2020

I have 2x the same RTX 2080 Ti.
I'm running the client with this:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
As you can see no backend and no parallelism.

Okay then I will use 2 clients instead of one. But I still have the same problem.
Also note that I think to most people don't know that the second gpu overrides the first gpu, when only one client is in use.
So this should be also fixed.

When using chess guis I use: (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
and roundrobin.

@borg323
Copy link
Member

borg323 commented May 6, 2020

I asked about backend in case you were using the relatively new dx12, but I see you use cudnn-fp16 so this shouldn't be an issue.
Is this only happening on run 3?

@lex312
Copy link
Author

lex312 commented May 6, 2020

No this happens also when running run 1 or 2.

@lex312
Copy link
Author

lex312 commented May 8, 2020

I have the same problem when I use a power limit of 40%.

@borg323
Copy link
Member

borg323 commented May 8, 2020

When you run the client, the output near the top contains the exact lc0 command line used. Can you try this on its own to confirm the client has nothing to do with this?
Example from an old log I had: /content/lc0/build/lc0 selfplay --backend-opts=backend=cudnn-fp16 --parallelism=32 --visits=10000 --cpuct=2.5 --cpuct-at-root=2.5 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.90 --temp-endgame=0.75 --temp-cutoff-move=16 --temp-visit-offset=-0.8 --fpu-strategy=absolute --fpu-value=-1.0 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000012 --policy-softmax-temp=1.2 --resign-wdlstyle=true --training=true --weights=client-cache/3eb9d62ecc6aa2a84b7cdb789c50702a02477cf969949cf7ed788b71a3ea9cfa

@lex312
Copy link
Author

lex312 commented May 8, 2020

@borg323
What exactly do you want me to do?

On my machine it looks like this:
Z:\LC0>client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name)
Lc0 client version 26
2020/05/08 15:00:09 lc0_main.go:956: serverParams: [--visits=10000 --cpuct=1.32
--cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --r
esign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60
--temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strate
gy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040
--policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alp
ha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=s
huffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slop
e=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0]
Args: [Z:\LC0/lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle
lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par
ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem
p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red
uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0
--minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=
true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg
n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-
left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --
moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57
96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9]
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Creating backend [multiplexing]...
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 10.1.0
GPU: GeForce RTX 2080 Ti
GPU memory: 11 Gb
GPU clock frequency: 1545 MHz
GPU compute capability: 7.5
PGN: [FEN "bnrnkbqr/pppppppp/8/8/8/8/PPPPPPPP/BNRNKBQR w KQkq - 0 1"]

@borg323
Copy link
Member

borg323 commented May 8, 2020

Then the command to run would be:
Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

@lex312
Copy link
Author

lex312 commented May 8, 2020

@borg323
I tried to run the command and got this:

Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 -
-parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-c
puct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0
.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-stra
tegy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-r
oot=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-w
dlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --ope
nings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2
--moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-facto
r=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fd
f4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Unhandled exception: Cannot read weights from client-cache\fdf4c93b5796723fd1ec8
8b09dcc92474a727a582ebf028ece402eb6fe50c3a9

Also I got a little taskmanager window with the information: lc0.exe doesn't work anymore.

@borg323
Copy link
Member

borg323 commented May 8, 2020

Probably you are not running lc0 from the same directory the client (and lc0) are in. I assume this is Z:\LC0. There should be books and client-cache subdirectories, the first one containing 960fen.pgn and the second one containing fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9.

@lex312
Copy link
Author

lex312 commented May 8, 2020

@borg323

I have inside Z:\LC0
lc0.exe and client.exe and the other basic lc0 things. Also the books and client-cache subdirectories are there. 960fen.pgn is inside books and inside client-cache I have the right fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

I want to donate gpus to -run 3.
That's why I open an empty cmd and pasted inside what you have wrote me before.

@borg323
Copy link
Member

borg323 commented May 8, 2020

We appreciate it, but first we need to figure out what is causing the problem. Here is the procedure:
Open a cmd window and then type:

Z:
CD \LC0

This will take you to the LC0 directory, and then run the command I gave earlier. I expect it will have the same problem We can then try to modify the command to see if we can isolate the issue.

@lex312
Copy link
Author

lex312 commented May 8, 2020

@borg323

I have typed:
Z:
CD \LC0
and then the command to run, which I've got from you.
I will tell you later when it crashes again.
This is how it looks now:

Z:\LC0>Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle
lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par
ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem
p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red
uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0
--minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=
true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg
n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-
left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --
moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57
96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
_
| _ | |
|_ |_ |_| v0.25.1+git.69105b4 built Apr 30 2020
id name Lc0 v0.25.1+git.69105b4
id author The LCZero Authors.
Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58
2ebf028ece402eb6fe50c3a9
Creating backend [multiplexing]...
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 10.1.0
GPU: GeForce RTX 2080 Ti
GPU memory: 11 Gb
GPU clock frequency: 1545 MHz
GPU compute capability: 7.5
gameready trainingfile Z:\LC0/data-hyepracghopp/game_000029.gz gameid 29 play_st
art_ply 0 player1 white result blackwon moves b2b3 g7g5 b1c3 f7f5 g2g4 f5f4 e2e3
b8c6 d2d4 e8g6 f1e2 e7e6 d1b2 d7d5 e2d2 c6b4 c1a1 g6c2 a2a3 c2d1 d2d1 b4c6 g1g3
from_fen rnknbqrb/pppppppp/8/8/8/8/PPPPPPPP/RNKNBQRB w KQkq - 0 1
tournamentstatus P1: +0 -1 =0 LOS: 15.87% P1-W: +0 -1 =0 P1-B: +0 -0 =0 npm 600.
875000 nodes 14421 moves 24
gameready trainingfile Z:\LC0/data-hyepracghopp/game_000001.gz gameid 1 play_sta
rt_ply 0 player1 white result whitewon moves d2d3 c7c6 f1g3 f8e6 b2b4 d8c7 e2e4
a7a5 b4b5 g8f6 e4e5 f6d5 a2a4 e8g8 g1f3 d7d6 e5d6 c7d6 e1g1 b7b6 f1e1 a8b7 g3f5
b7c7 e1e6 d6h2 from_fen qrbbknnr/pppppppp/8/8/8/8/PPPPPPPP/QRBBKNNR w KQkq - 0 1

@lex312
Copy link
Author

lex312 commented May 9, 2020

@borg323

No crash after 8 hours with gpu 0.
I'm running now gpu 1 for 8 hours.

@lex312
Copy link
Author

lex312 commented May 9, 2020

@borg323

No crash after 8 hours with gpu 1.

What does it mean and what to do next?

@cn4750
Copy link

cn4750 commented May 9, 2020

Can you leave the above command running and open up a new command prompt and run the same command again but this time with gpu=0 changed to gpu=1? This way you can test running two at the same time on your two GPUs?

@Naphthalin
Copy link

No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.

I didn't see you trying these suggestions yet -- can you test whether the power spikes are still bad enough with lower parallelism to crash your PC?

@cn4750
Copy link

cn4750 commented May 9, 2020

The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.

@borg323
Copy link
Member

borg323 commented May 9, 2020

The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.

Or it may be some weird antivirus software reaction, having the same effect. The client doesn't do much more than downloading network files from the server, uploading results and running lc0.

@lex312
Copy link
Author

lex312 commented May 9, 2020

@cn4750

When I open cmd and use this:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
Then the problem still happens.

When I open cmd and use this:
Z:
CD \LC0

and then that:

Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9

Then it looks like I have no problems.

But the first is client.exe and the second is lc0.exe.
And I have no problems when using lc0.exe to play games or something using a gui.

I also tested gpu 0 and gpu 1 at the same time with two cmds and I have no problems when using lc0.exe.
But I have still the problem using the client.exe.

@lex312
Copy link
Author

lex312 commented May 9, 2020

@cn4750

Is there a way I can check if the networking driver has crashed?
Is it possible to upload less often???
I think the download from time to time should not be a problem but it looks to me like the gpus are producing extremly fast material to upload and upload and upload.
Maybe that's taken the client or the gpus with it.

@lex312
Copy link
Author

lex312 commented May 9, 2020

@Naphthalin

Have you a line of code for me, how it should look like when using parallelism 4 and that:
client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)

@lex312
Copy link
Author

lex312 commented May 9, 2020

@borg323

I'm using the 360 total security as antivirus software.
But the software would have asked me for a decision if it would found a virus or other things.

@lex312
Copy link
Author

lex312 commented May 9, 2020

And when I use to cmds:
client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name)
client -run 3 -gpu 1 -report-gpu -report-host -user (name) -password (name)
Then the problem still happens.

@lex312
Copy link
Author

lex312 commented May 10, 2020

Z:
cd Z:\LC0
client -run 3 -gpu 0 -report-gpu -report-host -parallelism=4 -user (name) -password (name)

Z:
cd Z:\LC0
client -run 3 -gpu 1 -report-gpu -report-host -parallelism=4 -user (name) -password (name)

This works fine.
I have no problems and no black screen after 9 hours of running both gpus.
The only difference is that I use here -parallelism=4.

Have someone any ideas what exactly caused the bug?
Can it be solved somehow or do I need to check parallelism from =5 to =31 too?

@Naphthalin
Copy link

Have someone any ideas what exactly caused the bug?

by mooskagh, first post:

As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs.
Both is quite common in dual GPU systems.

It's good to know that lower parallelism helps with stabilizing the power demand enough. We basically use parallelism to load the GPU more, but apparently that puts too much variation to the PSU.

@lex312
Copy link
Author

lex312 commented May 13, 2020

@Naphthalin

Lower parallelism doesn't helped.
I used 4, 8, 16, 17, 18, 19, 20, 21, 22, 23, 24, 32
I repeated also parallelism 4 and it crashed.
Sometimes it crashes after 30 minutes and sometimes it crashes after up to 13 hours and it doesn't matter what parallelism I'm using.

The PSU ins't to weak because it's the best PSU someone can buy for a lot of money on the market and it can be easy used with 4 GPUs.
There is also no poor cooling, because both GPUs have only 50 degrees celsius, when I decrease the power limit. The GPUs can also have 88 degrees celsius without problems.

Any other ideas?

@dshawul
Copy link

dshawul commented May 14, 2020

PSUs do deterioriate so I wouldn't be so confident about it no matter what. Tensorflow and AI apps put a "spikey" load on it and the minute it exceeds a threshold, your CPU will shut down. I recently had a very bad experience where I can do many things just fine but trying to train a net it shuts down in 30 minutes. The PSU had maybe be a +200 extra watt on it but that didn't help.
Your case maybe different but monitoring power usage right before it goes blank may give clues.

@Naphthalin
Copy link

If you had crashes at 4, then higher values for parallelism are likely worse.

What I don't see from this thread: Did you try starting two separate clients for the two GPUs with --parallelism=4 and experience the same crashes? I don't know the technical details of the client, as it is always recommended to start one client per GPU, but it could theoretically be that the client isn't as sophisticated when distributing jobs between several GPUs.

Still, the cause of your crashes 99% isn't software related, but comes from an apparently too unstable power demand of two GPUs, and the fact that your PSU is good doesn't necesessarily mean that it is good enough for this extreme scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants