Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected exit from client #39

Open
Ilis opened this issue Aug 7, 2018 · 9 comments
Open

Unexpected exit from client #39

Ilis opened this issue Aug 7, 2018 · 9 comments

Comments

@Ilis
Copy link

Ilis commented Aug 7, 2018

Client exiting unexpectedly when timed out.

2018/08/07 08:55:21 lc0_main.go:403: Bestmove has timed out, aborting match
2018/08/07 08:55:21 lc0_main.go:336: Waiting for candidate to exit.
2018/08/07 08:55:21 lc0_main.go:326: Waiting for baseline to exit.
2018/08/07 08:55:21 lc0_main.go:552: playMatch: timeout

Suggestion is trying to reconnect or restart client after aborting.

@borg323
Copy link
Member

borg323 commented Aug 7, 2018

Can you give some more details, like client version, command line and a few more lines of text output?

@Ilis
Copy link
Author

Ilis commented Aug 7, 2018

Well, it latest client for Windows. I ran it in cmd box without any parameters.

Other lines are very common, pgn, tournament status and so on. I'll provide more lines when it'll exit again. It exited a half of dozen times until now.

First lines after start are:

Args: [C:\Progs\leela-chess/lc0 selfplay --visits=800 --cpuct=1.2 --resign-percentage=0 --resign-playthrough=100 --training=true --weights=networks\f10efec80a2c3089f00cbcb2a2e78a8da8193574f326c38325415a6e3204b02f]
       _
|   _ | |
|_ |_ |_| v0.16.0 built Jul 20 2018
id name The Lc0 chess engine. v0.16.0
id author The LCZero Authors.
Creating backend [multiplexing]...
Creating backend [blas]...
BLAS vendor: OpenBlas.
OpenBlas [DYNAMIC_ARCH NO_AFFINITY Sandybridge].
OpenBlas found 4 Sandybridge core(s).
OpenBLAS using 1 core(s) for this backend.
BLAS max batch size is 256.
resign_report fp_threshold 0.523432
PGN:
1.f4 Nf6 2.Nf3 c5 3.c4 Nc6 4.e3 e6 5.Nc3 d5 6.Rb1 d4 7.exd4 cxd4 8.Na4 Bd6 9.d3 a6 10.c5 Be7 11.g3 Nd5 12.Nh4 Bxh4 13.gx
h4 Qxh4+ 14.Kd2 Qxf4+ 15.Ke1 Qh4+ 16.Ke2 e5 17.Qe1 Bg4+ 18.Kd2 Qh6+ 19.Kc2 Qe6 20.Qc3 dxc3 21.bxc3 Bf3 22.Rg1 e4 23.Rg3
exd3+ 24.Bxd3 Rd8 25.Rxf3 Ne5 26.Rh3 Nxd3 27.Be3 Nxe3+ 28.Kd2 Qxa2+ 29.Kxe3 Qf2+ 30.Ke4 Qf4#  0-1
tournamentstatus win 0 0 lose 1 0 draw 0 0
Uploading game: 1
2018/08/07 04:26:53 lc0_main.go:468: trainDir=C:\Progs\leela-chess/data-fztxydfmlkgs
2018/08/07 04:26:54 lc0_main.go:144: Completed 9 games in 15h42m22.5560492s time
2018/08/07 05:23:08 lc0_main.go:450: Received message to end training, killing lc0
2018/08/07 05:23:08 lc0_main.go:477: Waiting for lc0 to stop
lc0 exited with: exit status 12018/08/07 05:23:08 lc0_main.go:482: lc0 stopped
2018/08/07 05:23:08 lc0_main.go:484: Waiting for uploads to complete
2018/08/07 05:23:08 lc0_main.go:435: Removing traindir: C:\Progs\leela-chess/data-fztxydfmlkgs
2018/08/07 05:23:08 lc0_main.go:537: serverParams: [--tempdecay-moves=20 --temperature=1 --cpuct=1.2 --fpu-reduction=0.0
 --policy-softmax-temp=1.0]
2018/08/07 05:23:08 lc0_main.go:540: Starting match
Downloading network...
2018/08/07 05:25:00 lc0_main.go:549: Starting match
2018/08/07 05:25:00 lc0_main.go:322: launching 1
lc0 is never quiet.
Args: [C:\Progs\leela-chess/lc0 uci --backend=multiplexing --tempdecay-moves=20 --temperature=1 --cpuct=1.2 --fpu-reduct
ion=0.0 --policy-softmax-temp=1.0 --weights=networks\f10efec80a2c3089f00cbcb2a2e78a8da8193574f326c38325415a6e3204b02f]
2018/08/07 05:25:00 lc0_main.go:332: launching 2
lc0 is never quiet.
Args: [C:\Progs\leela-chess/lc0 uci --backend=multiplexing --tempdecay-moves=20 --temperature=1 --cpuct=1.2 --fpu-reduct
ion=0.0 --policy-softmax-temp=1.0 --weights=networks\68e4bd959131674d453bbde21c9e091e02d58beb24008077b3452652d1ee86da]
2018/08/07 05:25:00 lc0_main.go:348: writing uci

@borg323
Copy link
Member

borg323 commented Aug 7, 2018

You are using the blas backend which will be slow for training. By default it will be using a single core, but the client tries to do 8 matches in parallel which slows everything down, so this may be the root cause for your issues.

If you only want to use 1 core then add the --parallelism=1 option. If you want to use more cores, then use the --paralellism=4 --backend-opts="blas(threads=4)" command line, changing both 4s to the number of cores you want to use.

Since this isn't yet widely used, do let us know if it helped.

@Ilis
Copy link
Author

Ilis commented Aug 8, 2018

OK, I did add parameter --parallelism=1

But client exited after about 20 hours with the same messages.

https://gist.github.com/Ilis/cbc5dc8138ed64d1fdef4077c3a071a0

@Ilis
Copy link
Author

Ilis commented Aug 13, 2018

Today it was extremely fast.

https://gist.github.com/Ilis/8ddab5865cf535d6bb1aed5401e400f1

@borg323
Copy link
Member

borg323 commented Aug 27, 2018

If you are still having trouble, a patch just committed may help. You can find a binary here: https://ci.appveyor.com/api/buildjobs/kapjgioegjyawwwr/artifacts/client.exe

@Ilis
Copy link
Author

Ilis commented Aug 29, 2018

Well, now it don't exit on error. But I'm not sure, is it do the work? I can't find any strings about games played and uploaded to the server.

Creating backend [multiplexing]...
Creating backend [blas]...
BLAS, maximum batch size set to 256.
BLAS vendor: OpenBlas.
OpenBlas [DYNAMIC_ARCH NO_AFFINITY Sandybridge].
OpenBlas found 4 Sandybridge core(s).
OpenBLAS using 1 core(s) for this backend.
BLAS max batch size is 256.
2018/08/29 09:54:49 lc0_main.go:406: Bestmove has timed out, aborting match
2018/08/29 09:54:49 lc0_main.go:339: Waiting for candidate to exit.
2018/08/29 09:54:49 lc0_main.go:329: Waiting for baseline to exit.
2018/08/29 09:54:50 lc0_main.go:618: playMatch: timeout
2018/08/29 09:54:50 lc0_main.go:727: timeout
2018/08/29 09:54:50 lc0_main.go:728: Sleeping for 30 seconds...
2018/08/29 09:55:20 lc0_main.go:603: serverParams: [--visits=800 --cpuct=2.4 --resign-percentage=4 --resign-playthrough=
10]
Removing 7d49e70866d71808c3f329dcd110d56ce1bb83a0156fcbed0beea8c1286a94ae
lc0 is never quiet.
Args: [C:\Progs\leela-chess-blas/lc0 selfplay --parallelism=1 --visits=800 --cpuct=2.4 --resign-percentage=4 --resign-pl
aythrough=10 --training=true --weights=networks\22d918f8f1872177bc7b3a98eb24f1968393d93349d45c8565ffe9daa127f284]
       _
|   _ | |
|_ |_ |_| v0.17.0 built Aug 27 2018
id name The Lc0 chess engine. v0.17.0
id author The LCZero Authors.
Creating backend [multiplexing]...
Creating backend [blas]...
BLAS, maximum batch size set to 256.
BLAS vendor: OpenBlas.
OpenBlas [DYNAMIC_ARCH NO_AFFINITY Sandybridge].
OpenBlas found 4 Sandybridge core(s).
OpenBLAS using 1 core(s) for this backend.
BLAS max batch size is 256.

@Ilis
Copy link
Author

Ilis commented Aug 29, 2018

Ohh, it log some more strings while I wrote message above.

PGN:
1.a4 e5 2.e3 Nf6 3.d4 Nc6 4.a5 d5 5.Bb5 exd4 6.exd4 a6 7.Bxc6+ bxc6 8.Nf3 Bd6 9.O-O O-O 10.Bg5 Ra7 11.Nc3 Rb7 12.b3 h6 1
3.Bh4 Re8 14.Qd3 Rb8 15.Qf5 Bxf5  *
tournamentstatus win 0 0 lose 1 0 draw 0 0
Uploading game: 1
2018/08/29 10:09:55 lc0_main.go:471: trainDir=C:\Progs\leela-chess-blas/data-gkzckxkfqhyk
2018/08/29 10:10:06 lc0_main.go:147: Completed 6 games in 16h17m11.0915066s time

@borg323
Copy link
Member

borg323 commented Aug 29, 2018

Unfortunately, it seems your cpu is not fast enough to finish more games in time. You can use --parallelism=1 --backend-opts=blas_cores=4 to approximately double the speed by using all four cores. However you may still get occasional timeouts for long games.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants