lorserker/ben

Creating bidding matches between different versions of BEN

Closed this issue · 16 comments

Following instructions from https://github.com/lorserker/ben/tree/main/scripts/match/bidding I am trying to create a match between two different versions of BEN

Following the instructions it is not clear that this should be execute in the Anaconda environment

Cat is not available on windows

cat auctions1.json | python lead.py > leads1.json
cat auctions2.json | python lead.py > leads2.json

But ChatGPT gave me this fine asnwer
image

So just replace cat with type and you are fine to go

Then to sum up the Imp there is this command

cat compare.json | jq .imp | awk '{s+=$1} END {print s}'

On windows awk does nok exist, and jq was not in the install instructions.

I ended up using

conda install -c conda-forge jq

to install jq, and then Powershell (in the anaconda prompt) to sum the Imps like this

powershell -Command "Get-Content compare.json | jq .imp | Measure-Object -Sum | Select-Object -ExpandProperty Sum"

I have now played the first match between to version of BEN.

I created 100 boards and ran the test with the same configuration first with --search, and then without --search

The result for the 100 boards was that BEN with --search scored 109 Imps, and lost 85 Imps, so it was 0.24 Imp / board better.

BUT, both NS and EW are using --search, so in principle EW lost 0.24 Imp and should not use the search-parameter.

So the right way to find the difference might be to let NS at table 1 and EW at table 2 use --search, and the switch at the other table, just like bridge.

I now have some preliminary results

I have used 4 different versions
BEN1: Only using the Neural network
BEN2: Using the neural network, and if multiple bids > 0.05% from the neural Network, then follow the bidding to an end
BEN3: Using the neural network, and if multiple bids > 0.1% from the neural Network, then follow the bidding to an end
BEN4: Using the neural network, and if multiple bids > 0.2% from the neural Network, then follow the bidding to an end

BEN3 is the default when running the gameserver

Results

BEN3-BEN1 155 - 96 (+59)
BEN2-BEN1 157 - 109 (+48)
BEN4-BEN1 103 - 61 (+42)

So adding the follow of bidding is worth many Imps.

Now normally an average of 2 Imps a board pr. table is good bridge, but here we have not included any play of the cards, so I think the numbers are pretty high.

But it is interesting where the Imps came from, and this was the biggest swing of the match
image
where BEN1 bid to a terrible 6C, and the other bots managed to bid 6H swinging 17 Imps.

The problem at this board is Norths rebid over 3H, and the neural network returned
image
So selecting the right bid here calls for follow the bidding, as the uncertainty is to high for the network.

I tested this hand using GIB, and it bid 2D after 2C, a bid that BEN did not consider

Anyway the follow bidding resulted in
image
So here 3S was selected.

Now instead of letting BEN2,BEN3 and BEN4 meet each other I realized, that I could use BEN1 as a baseline, and compare how the 3 other BENs scored against BEN1.

When I started comparing the results I realized than the forward of the bidding was using a random without seed so it was not comparable - might explain the sometime weird slambidding

Story to be continued

@ThorvaldAagaard nice! definitely BEN2-4 are all better than the baseline. which one of them is best is probably hard to tell as there haven't been enough boards played.

although it is clearly good to do search, there is sometimes a problem when doing too much search, as you noted the other day:

  1. the NN strongly suggests a bid, but we search just because there is another candidate and the other candidate might be weird.
  2. after the search is done, two candidates differ in EV by just a tiny bit (this is very prone to randomness) and we choose based only on EV even if the NN clearly preferred one of the candidates

Currently 100 boards are fine to get some indication, then I will later run it on more boards to confirm my observations.

I found some interesting observations, that could give some ideas about how to improve BEN

Introducing "search" and a different confidence levels gave this result

0.05 changed the contract in 53 cases
0.10 changed the contract in 45 cases
0.20 changed the contract in 31 cases

In an ideal world, the neural network would suggest a bid on every hand with at least 50% confidence, so this is telling us, that for these hands, there is not enough training data

BEN (0.1) beat BEN (0.2) by 23 Imp with 12 different boards
BEN (0.05) beat BEN (0.1) by 16 Imp with 13 different boards
BEN (0.05) beat BEN (0.2) by 39 Imp with 23 different boards
(All baselined against BEN with only nn)

Updated scores - now reproduceable with a seed for random

BEN (0.1) -BEN 152 - 115 (+37)
BEN (0.2) -BEN 106 - 55 (+51)
BEN (0.05) - BEN 189 - 155 (+34)

So BEN(0.05) is creating a lot more swinging bridge, and even though it "wins" all the matches I am not sure it is the way to go, as we are "searching" more bids (as there are more situations when nn returns 2 bids to consider)

Well again the gain for BEN(0.05) was due to two slams.
This was the first:
image
A slam that only is good because it is winning

Funny is it though, that if we switch seats so BEN(0.05) is EW, West will get in trouble
image
Going 1100, and thus losing the board.

The other was more decent
image

But when recreating the bidding it settled in 4S, but it showed the problem with using "search" alone, as this was the input to the bots last bid

image

The expected score for the first four bids are almost the same, and here 4S was just a little bit higher, than the rest, but this is a fine example of that it is not just expected score, that should used, but probably a combination of value from nn, and the expected score.

@ThorvaldAagaard very interesting!

i like this last example. it really shows what the problem is.

if we figure out the solution, we can create a lot more data for bidding by letting ben bid hands out and do search (in the correct way)

if particular sequences are problematic, then we can put ben in those situations, sample the hands and bid them out

Yes, I am thinking about rounding Expected_score to nearest 10, and then if two (or more) bids have the same value then select the nn-bid with highest value.

I also have an idea about making the bid, and then ask for sample hands for each bid, and then compare it to the actual hand. Then commit to the bid, where the hand is best matched to the samples. But I think that is tricky do implement,

But generally I think the slambidding is difficult because we have a lot less data for the training in that situation.

Knowing GIB's system (and thus the training data) it is a good example of where the Neural Network gets confused.

On the South hand you would never bid 3D opposite a passed partner, but opposite a non-passed partner 3D (Showing shortage) is the right bid. So the nn should not assign an insta_score for 3D in this situation

I have now fixed the randomness of the program, so a --search should give the same result no matter the threshold for selecting a bid from the neural network. I will return with some results later.

I have taken Thorvalds Ben and using my bidding and info files I am hanging on to WBridge5 at about 1 imp a board loss.
My files are trained on 2.5 million hands from Bidding Analyzer using mostly GIB conventions.
4.4 million iterations for bidding and 1.5 mill for info.
Just need to solve the sample issue

Have you made any statistics about, where BEN is losing Imps?

You have mentioned earlier:

I can beat most 1 million iteration bens but no chance against even level 3 Wbridge5 where we lose by about 1.5 imps a board.

So did the change in filtering samples based on possible lead really gain 0.5 Imp on average?

Is WBridge5 playing 2/1 GF?

Bidding Analyzer is better than GIB but still there are many flaws.

I created a difficult set of hands and compared BEN without search with a version making search, and the result was almost 4.5 Imp/Board, so searching is important.

Is it possible for you to send me the boards (and results)?

I will gather up the pbn files of my last 5 or 6 matches and clean them up and then send them or post them ... Have to take the dots out of the Bridge Monitor records first with a macro. Current match has Thorvald Ben with my bidding/binfo 165 imps behind on 187 boards ,,,,
thor4480e v WBridge5.txt
(https://github.com/lorserker/ben/files/12242022/Match.sheet.Random-.thor4480e.v.WBridge5.txt)
report.txt

The pbn would not work but the above is a txt file but it is pbn format ... the last match with boards rotated 90 degrees for the replay

Ok, I have had a look at the PBN-file - annoying it is rotated, but still readable.

The first interesting board is number 5, where WBridge5 plays 4SX, that is down one, but BEN is letting it win, and lose 10 instead of winning 6 IMP.

Looking deeper at this board, the problem is that BEN trust the WBridge5 bidding to much, as this position is reached
image

and BEN is placing CQ at declarer, so the contract will always make, and play CK (it is a 50-50 play) giving away the contract.

I think you should try to lower the trust for bidding
image

Changing bid_accept_threshold to 0.05 gave me this
image
So now playing low was clearly better

At board 25 the training data is bad,

Bidding it with the default BEN, 3N is the contract.

Bidding analyzer will in fact reach 7D
image

The implementation of Soloways jump shift should find this hand to weak.

I was trying to run matches with NS and EW at table one and two respectively having no search and the others having search but I couldn't work it out at the time. I'll have another look. The main reason I wanted to be able to do this was so I could tune the parameter which decides how many bids to pull from the NN i.e. how bad the threshold is. I think this is a very important number. I have reduced it significantly from the original in the best results.

I'll have another look at the script file and see if I was just stupid that I cldn't work it out and then I'll run some matches when I get chance. I found I could do 100000 in a day if I remember right. That shd be enough for anything, well for now anyway.

If you look in scripts/match/bidding there are scripts for setting up matches between bots playing the same system, with different search criterias (0.1 that is default, and 0.05 and 0.2)

In generateall.cmd I have

call match 0.05
call match 0.1
call match 0.2

call vs.cmd 0.2 0.05
call vs.cmd 0.1 0.05
call vs.cmd 0.1 0.2

Have a look at it, and please return with questions

Closing this as matches is playing fine