lichess.org
Donate

WDL stuff

moved here. from thread post-blog on WDL and sharpness concept.
lichess.org/forum/community-blog-discussions/ublog-spKmwjw5?page=4#35

@GnocchiPup said in #35:
>However, WDL as done by Lc0 is different as these are playouts from any given position.

Whithin the self-play RL batches during training, no? Which might are more exploration in the first batches, and in my possible wrong uderstanding, more of the "any" notion in early positoin than late position (as there might be more early game terminiatoin in more exploratory batches.

But I do agree that high level human might not be very exploratory (and SF might not be either).

However, not knowing what we know by other means, the mechanics post WDL conversion, might be the same.

But you make a point, I just only nuanced, while asking if that nuance makes sense to you.
GnocchiPup Replying to above in orginal post blog thread.
lichess.org/forum/community-blog-discussions/ublog-spKmwjw5?page=4#39
"@dboing"
Take this with a grain of salt.

But here's how I understand on how a0 and Lc0 implements search.

Self play to generate the NN. This provides the engine with the bias on which positions look good, and which to search first.

During the actual game, self play Montecarlo search. This is where its WDL comes from and not an a priori. In theory a SF 0.0 would produce a WDL of 0 100 0, but the same position could look very different for Lc0. Could be 10 80 10, could he 30 60 10. Both cases, best play is draw, but second case is harder for Black. I have a position in mind to test this, might do a post in the future.

This is how I understand how they implement it.
GnocchiPup said in #39:
>dboing
>
>Take this with a grain of salt.
>

exchanging grains of salt dependent second hand (non-dev) understandings as well. I think you have the probability set up compatible with mine, during play. There is indeed a tree search based on some probability model using state and action components (evaluation and policy). But the documentation I had read was not as clear put or not clear at all to me about the difference between training and playing. The probability model with WDL is intrinsic to the NN training, as well. The rewarding objective function to optimize is based on some "tournament" like 1D function of the 3 outcomes. The probabilities are dependent variables at the statistical level. (avoiding math. but, frankly doc is also avoiding that level of math.).

And yes, initial self-play batch is having a uniform policy (action) prior on all states ("positions"), but it is trained uniquely from the initial standard position for each game "sample " of the chess world (however there have been relatively recent ensemble biases learning experiment published, I still need to read about the details of, bad English, and just FYI, not a point here).

I do not understand well the tree search aspect, as it has not been explained at the same level or for chess, as it has for go, or more general games at a mathematical level for the actual probability model. I might have given up too soon, why I try to have others angles on the same thing, having lost the digging energy.

>Recapping on your post.
Self-play(1) to optimize all the parameters of the NN(2), upon games played to their ruleset outcome (no resignations, yeah!).
Initial batch uniform bias, being trained with new information along that self-play learning trajectory over game set experience (batches). Yes, this keeps converging to something (albeit with some obligatory RL compromising on exploration, per the batch chunking scheduling), of a more expert bias.

>glossary (well added phrases), for last paragraph.
(1) self play: batches on a schedule about "gain" or when to make another batch with improved self instances.
(2) NN (a function basis of the full state input "vector", i.e. position, ... salt here for some post implementation trickery i also am not clear on.. and possible not a point here).
> Now about the parts I am not sure from your post (or the target common questions we have).

So, you say that there is no MTCS for the training phase, and that since they lost patience on the convergence and some tournament for showcasing was dead-lining all that (kidding, this post needs some humor, even if just for my keeping at it), they have not yet converged to the true number 42 limit probability model that would actually be best chess (given the maxima extent of explored position world throughout all the self-play batches. me reminding with question agenda). So MCTS to the rescue to complete this imprefect view, for the remaining lack of confidence on a single position and its legal actions set.

But my disconfort here, is that this is also what happens during the self-play, this is how the game outcome are produced.
Across the batches, only one of the "self" players is being adpated through reinforcement loop from the game outcomes of that self-play game production bacth. But both side implement their probability model (state and action) under evolution (not necessarily progress, just changing the way i mean, but yes, otherwise rigged to improve).

One fixed the other being parameter "tuned" (for SF readers, because normally tuning is for hyperparameters, but those words get in the way of helping each other understand, they are me lacking ability to explain shortcuts: jargon are placeholders for my future iterations). upon chess environment feedback through only game outcomes. No oracle involved. I guess in RL one has to specifiy the learner boundaries and the environment being learned specifications. In LC0: input is chess and traget output is chess board response as environement feedback doing the "psychological" learning theory machine implemented reinforcement.

I was just adressing what I thought you meant about MCTS being there as compensation for the not yet perfect probability model under evolution through RL. The deep comes from the NN being deep...

So I guess I agree up to the where is the WDL model coming from. It seems you understand certain things I don't, is the above still compatible with yours. or adjusting toward another reply for a constructive forward seeking discussion? It may take time to chew.. You have actually helped me make some progress I think. Confirming my hunch about the compensation for not having attained the thing I could understand from the math. of the A0 paper.. I could not understand the tree search math because nowhere presented and yet understand the limit process.. If not mistaken above. It means I have not wasted my past energy compartmentalizing this chunk of fog about tree search.. I understand well the one node play model. You helped me reinforce my confidence saying so..

but.. even A0 vanilla is having restriction on the "any position" as I hinted above about a recent paper.. let me find it for you if intereest. we could make a reading thread about it in non-blog forum. Sorry blog op.. for this possible high-jacking.

Possible common underlying topic: covering the position world.

yes. one has to pay attention on where the WDL come from.. that is conclusive and not my disconfort. or question set.
i know some theory pretty well (well some self-critical satisfaction level).

I am missing the gap from A0 paper description through implementation in LC0, but not about the particular choice of algorithms, e.g. MCTS versus PUCT, that is not what I seel as level of documentation of information, so that I can understand what WDL or LC0 is doing to chess, the thing that both thiese algo are implementing could be described at the same level as A0 paper (or its backreferences on deep RL) did working at probablity function levels for (position) state and action (move) domains.

In documentation a tad bit elevated from source-code there are detail of implementation that obscure the picture I seek.. I find it hard to chop their whole description in LC0 doc into something I can relate back to the A0 paper. and it seems to come from the run of the mill assumptions rarely shared about how replay works. They keep describing what the code is doing step by step, but not the high level point of view which I started above..

self-play batches... first uniform random (the policy given any position, over all moves for that position, over all putative position). so each time the game tally per batch (with only one engine instance improving) gets over a certain amount above 50-50 of wins. they stop.. and start a new batch where both instances are the previous learned one.

this is the RL self-play schedule level. now I need your help or anyone's help. about the production of those games within such a self play batch.

somehiwere in there there is the replay thing, the reward thing, the confidence termination cirth thing, and the backpropo thing.

above that I am fine.

Join the dboing Musings Echoes team, to post in this forum