Valett

From NASPAWiki

Valett is an open source software tool for calculating fair values for letters in word games, developed by Joshua Lewis and announced on his blog in late 2012.

Here are some mostly technical thoughts about Valett, by John Chew.

Catastrophic Outrage: A Reply to Joshua Lewis’ “Rethinking the value of Scrabble tiles”

On his blog, Joshua Lewis announced the publication of his open source software, Valett, a tool for calculating ideal letter values for word games, inspired by his experiences as a SCRABBLE enthusiast. While Lewis’ intentions are clearly good, he has stirred up a lot of public attention by suggesting that the SCRABBLE game would be better if it used tile values computed by his program instead of the traditional values. I disagree, and have some issues with his methodology as well.

As I understand it, Valett takes a word list and an estimate of the distribution of word lengths that will be played in a game, and creates a list of letter values based on how difficult it estimates that it will be to play each letter in the game. It estimates this difficulty based on how many words each letter appears in, and which other letters each letter appears beside. The Q is hard to play not only because it only appears in 1.4% (2576/178691) of the playable words, but also because in all but 42 of those words it is followed by a U.

This kind of thinking is more advanced than what Alfred Butts did with the words that appeared on the front page of the New York Times to create the initial version of the tile values that eventually became enshrined in the SCRABBLE rules, but significantly behind the times in modern SCRABBLE strategy theory. Quackle, the free SCRABBLE artificial intelligence that has earned a 2224 Elo rating against human opponents, keeps track of how playable every possible combination of one to six tiles is, based on extensive simulation of game play. For example, it finds that a single “I” left on your rack after a play will lower your average final score by two points: as SCRABBLE players put it, an “I” has an equity value of -2. Given that the “I” currently has a face value of 1, if you wanted to create a “fair” SCRABBLE game that didn’t penalize players for drawing an “I”, you’d want to increase the face value by 2 points to 3. Likewise, you’d want your blanks to be worth -26 points. Yes, negative twenty-six: sure, you can often make a seven-tile “bingo” play if you draw the blank, but it wouldn’t help you much if it cost you 26 points every time you did.

If you did this, you’d reduce a little bit of the luck of the draw, but at the same time you’d be reducing the skill involved in recognizing which tiles are good or bad and playing accordingly. You’d end up with a game that was a little closer to just rolling a die to determine the winner. It’s worse, though, because as Quackle knows, equity values are not additive, and are not strongly correlated with adjacent tiles, as assumed by Valett. Two “I”s on your rack have an equity value of -12, and three “I”s -21. Even if those “I”s were now worth three points, you’d still be gnashing your teeth at your bad luck if you drew more than one of them at a time. There would always be lucky and unlucky tiles in the bag.

As an aside, this points out a smaller failing of Valett: it doesn’t prescribe a distribution of tiles. There are nine “I”s in a SCRABBLE set, and it’s generally understood that the “I” is a bad tile not only because multiple “I”s go poorly together in words, but because there are too many “I”s in the bag to begin with. There would be a lot more support for removing an “I” from the standard tile distribution than there would be for changing the value of an “I”.

The particular tile distribution also affects the playability of tiles. While there is a “Q” in 1.4% of words, we can’t have 1.4% of the tiles in the game be a “Q”, so we round down to one tile, making it slightly easier to play. For one thing, there’s no chance of drawing a second “Q” from a regular bag.

Valett’s requirement that you specify the rates at which words of each length are played is also problematic for a few reasons. The author finds in his own experience that he tends to play words of 2, 3, 7 or 8 letters more often than words of 4, 5 or 6 letters. This is typical for players who have done a little SCRABBLE word study, concentrating their efforts on where it pays off the most: 7- and 8-letter words to get that 50-point bingo bonus, and the 2- and 3-letter words that you need to know in order to find ways to fit those bingoes onto the board. It’s not typical of high-level or computer play, but who then is right? Should you play with different values against a weak opponent than against a strong opponent? Who should get to choose?

Finally, there’s the question of why the values should be changed at all. It would certainly cost the consumer a lot of money and lead to endless confusion about what version of the game was being played.

The design of the SCRABBLE game carefully balances skill and luck, and its enduring popularity is a consequence of this balance. The game has enough skill that a winner can take pride in his abilities, and if he hones his skills further, can win more often; the game has enough luck that a loser can tell himself that he could have won had he drawn better tiles. People often suggest to me ways in which this balance could be tipped, and given that the people who talk to me are usually skilled players, not surprisingly, the suggestions are almost always to remove a little luck from the game in favour of skill. The game doesn’t need this: it’s always had an intentional imbalance between the face and equity values of the tiles, and a deeper understanding of this tension can increase one’s enjoyment of the game.

Continuing Outrage

John Chew replies to comments about what he posted above earlier.

Let’s begin with a few corrections and clarifications. Craig Beevers wrote on the NASPA Facebook page:

“The bit about altering face value to compensate for equity is a bit dubious. Because you’re generally not getting face value for tiles you play. You are typically getting many times the face value. So adding 1 to the face value would not add 1 to the equity of the individual tile leave.”

Touche: I was oversimplifying the argument, but am delighted as always to fill in technical details. If I wanted to try to come up with a set of tile valuations that accurately represented their playability in the game though, I would use an iterative process based on the basic idea of adjusting face values based on their observed equity. That is, I would try subtracting the observed equity value (multiplied by a scale factor that decreased with each iteration in the hope of better convergence) for each tile from its face value to create a new set of tiles, then repeat the process with those tiles, until the sum of the absolute values of the equity values (or their squares) reached some desired minimum. You would still end up with a set of tile values that had very low single-tile equity, but a wide range of multi-tile equity values.

Curran Eggertson writes on the NASPA Facebook page:

“I don’t understand why you lead with a title of ‘catastrophic outrage’.”

It’s a reference to a sensational BBC pull-quote from an interview I did with their More or Less programme. Eggertson adds,

“I also couldn’t disagree more with the passage: ‘You’d end up with a game that was a little closer to just rolling a die to determine the winner.’”

I think I edited that thought down far beyond the point of accuracy or clarity; thank you for pointing that out. What I should have said was that it would centralize the distribution of turn scores in the game. Unlike the current situation, where an expert player who falls behind in a game can hope that the unevenness in the valuation of the tiles may end up favouring him like the rolling of a poorly weighted die, or one with a less evenly distributed set of numbers than 1, 2, 3, 4, 5 and 6; in a situation where tiles had values that accurately represented their utility in the game, the valuation would lack that unevenness and as a result, the game would be more like repeatedly rolling a fair die and summing its results, i.e., with a more centrally weighted distribution that inhibited the underdog from catching up.

Now onto the good stuff, the arguments.

Joshua Lewis, who ignited this “outrage” with his blog post in December, has posted a response on January 16. In addition to correctly catching me out on the points that Beevers and Eggertson made, he argues that my criticism of Valett is invalid because the things I said it should do are not the things that it was intended to do. I disagree, but before I dig into why, I should preface what I am about to write by saying that I respect Lewis’ appreciation for the SCRABBLE game and its strategy, as well as his level of skill in the game, bear him no personal ill will and am in fact grateful for the public scrutiny that he has brought to bear on what I had thought were some pretty abstruse and obscure aspects of the game.

Now there are many reasons why the SCRABBLE game works and is popular, among them the way that I usually describe as its personal appeal to me: the competitive game consists of a series of about a dozen mental puzzles, each of which takes a minute or two to solve, and for which you receive instant gratification in the form of points added to your score. It’s like doing a crossword puzzle, but less predictable. It’s important that the reward you receive, the points, for each play be commensurate in some way with the difficulty of the challenge. Play all seven tiles and get 50 extra points. Play the Q or the Z and get a base 10 points for it. If the Q were worth just 1 point, the game would be no fun.

Lewis wrote in December that he got started on this train of thought because the addition of “words like QI and ZA” changed the game because “the values for Scrabble tiles were set when such words weren’t acceptable, and they make challenging letters much easier to play.” He noted later in that post that he weights two- and three-letter words more heavily in the input to Valett because they are “valuable for playability (one can more easily play alongside tiles on the board)”.

So we all agree that fair tile values are important, and that fairness means that tiles that are difficult to play ought to be worth more points.

But Lewis and I differ on the question of what tile playability means. Lewis says that it should be determined on the basis of letter and di-gram statistics in the lexicon; I say it should be on the basis of empirically observed statistics in play. Lewis wrote Valett to analyze a lexicon, but provides a way to tweak its results by specifying the empirical frequency of word lengths in the game. Why? Because if you base the tile values just on what’s in the dictionary and don’t adjust for word length playability, you don’t get good results. And if you’re going to mess with the statistics a little to try to get them to match what happens in real life on the board, why not just calculate the statistics based on what you see on the board in play?

Lewis explains that there you can divide the analysis of the game into “the structure of the game” and “the play of the game”, and says (notwithstanding his frequent references to the play of the game) that Valett is actually to be used only for adjusting “the structure of the game” by putting “in harmony [...] the word list and tile values”. Harmony sounds nice, but the game isn’t about drawing words at random from the dictionary; it’s about actually finding places to play them on the board. The extent to which lexicon-tile value harmony improves the correlation of play effort and play reward is only insofar as the statistics of the lexicon correlate with the statisics of played words.

Suppose we accept for the moment, though, Lewis’ claim that Valett’s method, using his particular arbitrary weightings for word lengths (which, for what it’s worth, underestimate the effect of four- and five-letter words in the game), is an improvement on Alfred Butts’ original method and tile values. He claims that “it’s nice that the distribution changes are minor from TWL06 to SOWPODS, as they should be for word lists based on the same language”, and that only one tile (the G) would differ at all, and that by one point. This is directly at odds with the observed data on playability of tiles in the two lexica, where the G is among the 7 tiles whose equity value differs by less than half a point, while 8 tiles differ by more than a full point.

It’s possible that this discrepancy might be resolved by a more judicious choice of word length weightings in the input to Valett; their arbitrary nature does not however give me confidence, and if I were to try to choose word length weights to make Valett generate tile values that resembled actual difficulty of play, I would rather compute those latter values directly.

On the question of whether or not the tile values ought to be adjusted at all, I reiterate that the tile values were chosen to make an interesting game, not to accurately represent the statistical properties of a particular lexicon. I agree that every time the lexicon changes, the strategy of the game shifts together with the induced changes in the hidden equity values of the tiles. I don’t think that the game needs a shift toward more skill and less luck at this point; if anything, I think a shift in the opposite direction might make it even more popular.

I disagree in particular with Lewis’ statement “Tournament players benefit from a system with a little less luck because it makes tournaments more accurate.”

Every player enters a SCRABBLE tournament thinking at least for a moment that they have a chance to win the event. If they get the good tiles. If they make the right plays. If their opponent makes a few mistakes. If they’re lucky. If you shift the luck/skill balance toward skill, this benefits the players at the top of the field at the expense of those at the bottom of the field. If you do this too much, then those players at the bottom of the field will stop going to tournaments, and the tournament experience will end up being poorer for the top players. What you get is strong players having a slightly more accurate idea as to their relative ranking, but fewer players overall.

And if I were to accept Lewis’ point of view that changed tile values that take a little luck out of the game are a good thing, how could I not think that changed tile values that take a lot more luck out of the game would be a better thing?