QSSIM - Qualifying System Simulation Robert Parker November 1997 ---Tourney data--- The data set consists of about 110,000 games played between about 3700 players over a four year period (Nov. '92 to Sept. '96.) ---Robots--- 32 of the top players have been replaced by robots with fixed, known abilities. Each game involving a robot is simulated, with P(Win) = CumulativeLogistic(parameter=B,diff=RobotAbility-OpponentRating). The parameter B has been estimated from the '92-'96 data to be B=1/156. ---Method--- a. Randomly assign the 32 abilities 2100, 2095, ... 1945 to the 32 robots b. In all '92-'96 games, replace 32 players with robots. c. Sequentially rate games from '92 to 950600. (That's June 0, 1995.) d. Start Qualifying Period. e. Sequentially rate games from 950600 to 960600. f. End Qualifying Period g. Calculate the following QS methods: A. Iterate only when #games >= 50 1. OPRmleI (iteration for robots only, ratings curve stddev = estimated from data) 2. IOPR (iteration, ratings curve stddev = 200*sqrt(2)) 3. IOPRmle (iteration, ratings curve stddev = estimated from data) B. If iterating, iterate only when #games >= 30 4. OPRmleI (iteration for robots only, ratings curve stddev = estimated from data) 5. IOPR (iteration, ratings curve stddev = 200*sqrt(2)) 6. IOPRmle (iteration, ratings curve stddev = estimated from data) C. Iteration for robots only 7. OPRmleHI (iteration when #games>=30, ratings curve stddev = estimated from data, Opp Strength = Max rating during Qualifying Period) D. No iteration 8. OPR (no iteration, ratings curve stddev = 200*sqrt(2)) 9. OPRmle (no iteration, ratings curve stddev = estimated from data) 10. HI (Peak rating during Qualifying Period) 11. RAT (Current rating at end of Qualifying Period) ---Statistics--- Two statistics are calculated: 1. Kendall's tau, a measure of the correlation between the known robot ranks and the QS-assigned ranks. Higher is better. 2. n-out-of 10: The number of the known Top 10 robots who are ranked among the Top 10 by the QS. Higher is better. ---Points about the simulation--- Robot abilities are kept fixed over the entire 4-year period. Of the (8695) games involving Robots over the 4-year period, (958) were Robot vs. Robot. The Qualifying Period was 950600 to 960600. Each robot played at least 50 games within that period. The problem of assigning initial ratings for the Robots was avoided: Each was treated as a new player starting in '92. ---A note on calculation of OPR--- I've actually used the logistic distribution instead of the normal distribution as the ratings curve in the calculation of OPR. There is very little difference in the rankings from the two methods: using the methods to calculate OPR for WSC97, we see three sets of rankings that are affected: #'s 23 and 24 are switched, #'s 44, 45, and 46 are jumbled, and #'s 39 and 40 are switched. More on that later. ---Conclusions--- Here are averages and 90% confidence intervals for the two statistics and the 15 methods. See a graph of this at http://www.math.unm.edu/~rparker tau n-out-of-10 Upper Lower Ave Upper Lower Ave OPRmleI 0.5821 0.5669 0.5745 7.1309 6.8791 7.0050 IPR 0.5777 0.5618 0.5697 7.0813 6.8187 6.9500 IPRmle 0.5778 0.5620 0.5699 7.0776 6.8224 6.9500 OPRmleI 0.5821 0.5669 0.5745 7.1309 6.8791 7.0050 IPR 0.5748 0.5588 0.5668 7.0416 6.7884 6.9150 IPRmle 0.5742 0.5581 0.5661 7.0230 6.7670 6.8950 OPRmleHI 0.5740 0.5578 0.5659 7.0362 6.7738 6.9050 OPR172 0.5817 0.5661 0.5739 7.1003 6.8497 6.9750 OPRmle 0.5806 0.5651 0.5729 7.1061 6.8539 6.9800 HI 0.5804 0.5621 0.5713 7.2825 7.0375 7.1600 RAT 0.6325 0.6171 0.6248 7.5460 7.3240 7.4350 The only measure that distinguishes itself is RAT, the rating at the end of the qualifying period. Looking at n-out-of-10, HI has a slight (but insignificant) lead over all except RAT. The n-out-of-10 statistic is probably more pessimistic than we would see in the real-world situation, due to the assignment of the Robot abilities. (I conjecture that the top players are actually separated by more than 5 points of ability. ??) That is, I suspect that a QS will actually be able to do better than correctly choosing 7 of the top 10. It seems that the particular details of implementation of OPR (iterate vs. no iterate, how to measure Opp Strength, stdev of ratings curve) have little effect on the overall performance of the system. This conclusion depends on the tournament participation behavior of the 32 top players who were replaced with robots, however. Whether a player might, through clever scheduling, manipulate one of these systems is the subject of further study. --END POST--