MORE ON QUALIFYING SYSTEMS: A REPORT OF THE NSA RATINGS COMMITTEE Release version, January 13, 1998 Steven Alexander John Chew Jerry Lerman Robert Parker Dan Stock, chair Jeff Widergren ------------------------------------------------------------------ Part I: Overview ------------------------------------------------------------------ In November of 1997, the NSA Ratings Committee (RC) released a report detailing our months of study on the subject of building a robust qualifying system (QS). A qualifying system chooses a subset of players to play in certain elite tournaments. A QS consists of certain logistical rules together with a qualifying system algorithm (QSA). In our earlier report, we recommended a QSA called Iterated Overall Performance Rating (IOPR). IOPR's advantages, enumerated in our earlier report, had made it far more appealing as a selection basis than any of the other systems explored. Seven criteria were considered in the earlier recommendation of IOPR. One criterion, Feasibility, was judged vital; however, all the competing QSs were considered feasible, so they were judged on six other criteria. IOPR had fared particularly well in two of the most important criteria, Accuracy and Appearance of Fairness. We had decided to have members judge possible QSAs in two ways: both by their own weighting of the criteria, and by a compromise weighting scheme. We consider it important that the final QSA should not only be good when judged by the compromise scheme, but also that a majority of the members find that it is good when judged by their own favorite scheme. IOPR did well in both ways in our earlier report. In that report, we mentioned that there was a significant last step that we wished to perform: finalizing the details of the specific IOPR algorithm to be used by simulation using data from actual tournament schedules. It has taken a few months of hard, inventive, and highly intelligent work by committee member, Robert Parker, to put the simulation (QSSIM) together. (Robert's simulation process is explained in some detail in Part II of this document.) While he was at it, Parker also simulated results from the other QSAs we had considered. The results were quite surprising, and did not agree with the results of the earlier, simpler simulations that we had run. Most of the variations of IOPR ended up performing similarly. Overall, IOPR ended up as being slightly less accurate than the other systems, at least in terms of this particular simulation. The Final Rating QSA surfaced as the best predictor by a small margin; the Average of Three Samples and the qualifying system used by the US for the 1997 Worlds also performed better than IOPR, but by an even smaller margin; Peak Rating finished about the same as the IOPR variants. Some of the margins were small enough to perhaps not be statistically significant, though the difference between Final Rating and IOPR did turn out to be significant but still small. For reasons that will be explained in Parts II and III, we ended up favoring an uniterated version of IOPR -- that is, OPR. The qualifying system simulation (QSSIM) was quite sophisticated, but it was not without its limitations (just as we see in the simulations used to play Scrabble Brand Crossword Game). One of the most salient of these may be that the simulation could not account for the likelihood that competitors will alter their playing itineraries to obtain best results under any QSA which might be so manipulable. There might therefore be a tendency for this simulation to overrate most of the QSAs and underrate the OPR system, which has the least chance of being "gamed" by tournament selection. (A case in point: some top experts will only play in tourneys where there are many other top experts, which reflects a perception that the current rating system is more favorable to them with such a schedule.) And indeed, as one committee member put it, "If you have only a hundred or so of a player's games to look at, then the systems that give the most accurate estimate of the player's skill [using the simulation] are unfortunately those which would succumb most easily to manipulation by careful choice of tournament attendance." Following the simulation results, the RC members reranked the QSAs on the characteristics agreed as most important. The revote resulted in a much closer vote (see Appendix 1 for full details), but with OPR (or IOPR) retaining its status as the preferred choice. Since the vote was close and the choice of OPR (or IOPR) was not nearly as clear as in earlier voting, we decided, by a 4-2 vote, to partially rescind our earlier recommendation solely for IOPR and replace it with the following recommendation: RECOMMENDATION: The QSA should be either OPR (as detailed in Part III of this report) or Final Rating, with a preference for OPR. (Note: the two dissenting voters preferred recommending OPR only.) The Final Rating QSA came in second in our voting by a fairly small margin, largely due to its strong performance in the QSSIM simulation. Overall, the committee found that OPR's superiority in the Appearance of Fairness (especially due to the fact that it is difficult to manipulate) outweighed Final Rating's small edge in accuracy as judged by QSSIM. However, if some logistical or other problem made adoption of OPR difficult or expensive, the NSA might consider using Final Rating instead. The RC also wants to make clear that the other recommendations from our earlier report still stand. These include a number of recommendations for the logistics of the system, including such important ideas as: the length of the qualification period; a minimal number of games to be played in that period; a Currency requirement that some games be played in the final months of the period; requalification rights for previous strong performers; a qualification tourney; and others. See our previous report for details. Of particular note is the Currency requirement. With this requirement, the Final Rating system described here is not the same system as that used in 1993, for at that time there was no such requirement: players were allowed to "sit" on ratings for an unlimited period. ------------------------------------------------------------------ Part II: The QSSIM Simulator ------------------------------------------------------------------ The QSSIM simulator was designed and implemented by Robert Parker to compare QSAs, especially QSAs related to OPR. It uses a combination of real game and schedule data with artificial "robots". The reason for the robots is that one can set their actual strength and then see how well a given QSA ranks the robots in comparison to their actual strength. The same approach could not be done with players due to the impossibility of accurately saying what a player's strength really is: after all, judging a player's strength is just what a QSA is trying to do. The simulator uses a data set that consists of over 100,000 games played between over 3000 players over a four year period between 1992 and 1996. So that the robots would have realistic schedules, the simulation replaced 32 high-rated players with robots. (Forgive the straight line!) The robots were divided into two groups, according to the average rating of the opponents of the player replaced by each robot. The top group of 16 had average opponent ratings over 1900; the other group had average opponent ratings in the 1800s or the high 1700s. Each group of sixteen had each of their robots given different actual strengths ranging from 1950 up to 2100 (on the tens). In place of each of the replaced players' games, a simulated game was played; a curve that approximates the observed results of real games was used to determine the winner probabilistically. The data from 1992 to 1995 was used to supply each robot with a playing rating which intentionally only approximated its actual strength. Then, the last year of data (1995-1996) was treated as a qualifying period. Each of the replaced players had played enough games in the qualifying period for the statistics to be meaningful. After all the games with a robot had been simulated (over 8000 of them, including nearly 1000 games where two robots played, simulated 800 times), various statistics were gathered as to how each of many different QSAs and their variants ranked the robots, and how they compared to the robots' actual strengths. One of these statistics was how many of the top ten robots (by actual strength, as simulated) made the top ten (according to each QSA). Another statistic indicated which group of sixteen robots did better and by how much; that statistic allowed us to judge whether a method was biased toward those who play stronger or weaker players. A dozen different QSA variants were tested, about half of which were variations of OPR. One or more variants of each the other five QSAs in our initial report were also tested. Surprisingly, the results showed that all of the QSAs did quite similarly with regard to accuracy. For example, all of them averaged between 7 and 7.5 of the strongest robots finishing in the top ten. The Final Rating method finished with the highest accuracy, but showed a clear preference for players with stronger schedules. All of the OPR variations finished with no statistically significant difference in accuracy. However, only two of the QSAs did not show a significant bias toward either group: the version of OPR that we are recommending and a time-weighted version of the Average of Three Samples method. That version of the Average of Three Samples method also finished second in accuracy, just tiny amount behind the Final Rating method. (One committee member, who has supported the Average of Three Samples method from the beginning, wonders why this new evidence does not seem to impress other members of the committee.) Applying iteration to the OPR method (IOPR), as we had initially recommended, gave a significant bias in favor of players with weaker schedules. The Peak Rating method was comparable in accuracy to the OPR variants, with a slight bias in favor of those with stronger schedules. The QSA used for USA qualifying for the 1997 Worlds was about halfway between OPR and the Final Rating method in accuracy, with a significant preference for players having stronger schedules. For more complete QSSIM results, see Appendix 3. The results of QSSIM were interpreted in various ways by the committee. One committee member (#4 in the Appendix) used it to give all of the methods a grade of 7 for Accuracy. Two committee members (#1 and #3) used it with a finer grain, giving Accuracy ratings from 5 to 8 mostly relating to QSSIM results. The remaining three committee members based their ratings for Accuracy only partially or not at all on QSSIM, some citing its weakness in not allowing for players deciding whether to attend a tourney based on how attending would affect their chances of qualifying (e.g., the tendency to sit on a high rating in a Final Rating system). For more details on QSSIM, see . ------------------------------------------------------------------ Part III: Details of OPR, our Recommended QSA ------------------------------------------------------------------ Before giving the technical details of our specific version of OPR, we discuss some of the of the options that were considered. In general, OPR-like systems attempt to solve, for each Player i, the following equation: Expected_Wins[i] = Actual_Wins[i] Expected_Wins[i] depends on: 1. Player i's OPR, the independent variable. 2. Opponent Strength: the skill level of Player i's opponents 3. Ratings Curve: the function which gives the probability that a player will win a game, given the difference between that player's rating and the player's opponent's rating. There have been several suggested values for #2 and #3. For Opponent Strength, possibilities include: - Pre-Tourney rating - the opponent's rating when the tourney starts. - Post-Tourney Rating, the opponent's rating after the tourney ends. - Peak rating during the qualifying period. - Average rating during the qualifying period. - Opponent's Performance Rating from an earlier iteration of the OPR process (This defines the Iterated OPR process.) - Some weighted average of the above methods or mixture of the above methods depending on the number of games played during the qualifying period. For the Performance Ratings Curve, possibilities include: - The Ratings Curve - The same probability formula used in determining ratings - The Empirical Ratings Curve - A probability formula estimated from actual game data. As mentioned in Part II, QSSIM suggests that the performance of the OPR system, as measured by its ability to select the best players, is about the same regardless of which choices are used for Opponent Strength and Performance Ratings Curve. It should be noted, however, that the simulation depended on a rather non-responsive behavior of the players - it assumed that players would not vary their tournament schedules in response to the implementation of the OPR Qualifying System. For this reason, the particular choice of Opponent Strength and Performance Ratings Curve are made with several criteria (Cheatproofing, Appearance of Fairness, ...) in mind. In particular, the choice we made seemed to do the best job of assuring that a stronger or weaker playing schedule did not effect one's chances of qualifying. Interestingly, the Iteration technique seemed to introduce more biases than it solved, so we are now recommending an uniterated technique (OPR rather than IOPR). We have chosen the following OPR implementation: Opponent Strength will be measured by the post-tourney rating. The Performance Ratings Curve will be usual NSA Ratings Curve. This specification of OPR is the same as the Performance Rating already familiar to players, with these exceptions: 1. The "tournament" in this case is the entire qualification period. 2. The measure of Opponent Strength is the opponent's rating after each individual tournament (not the whole qualification period), instead of their pre-tourney rating. (Note that this means that Opponent Strength is usually not constant during the qualification period for any given opponent, as the opponent's rating typically will vary over the course of a year.) Here, then, are the details of our recommended QSA. To qualify, a player must play at least 70 games during the one-year qualification period. At least 15 rated games must be played within the final three months of the period, OR at least 25 rated games must be played within the final 5 months of the period. Publication of a player's qualification statistic should not occur unless the player has played at least 30 games in the qualifying period, since the data might not be significant with a smaller sample. A player with no losses is given a qualification statistic 500 points greater than the highest of her/his opponent's ratings. A player with no wins is given a qualification statistic 500 points lower than the lowest of her/his opponent's ratings. Otherwise, to calculate the qualification statistic for Player, 1. List all games played by Player (other than byes and forfeits) during the qualifying period, including: a. The post-tourney rating for each opponent played by Player, and b. A numeric value for the Outcome of each game: 1 for a win, 0.5 for a tie, and 0 for a loss. 2. Find the total Actual Number of Wins by Player by adding all the Outcomes from step 1b. 3. Use a search method to find the rating (the qualification statistic) at which Player would have to be for his/her Expected Number of Wins to be equal to his/her Actual Number of Wins (from step 2). For a given rating, the Expected Number of Wins is calculated by adding the probabilities of winning each game. These probabilities are determined, in turn, from the difference between the given rating and the opponent's rating (from step 1a) using the same "ratings curve" used by the current rating system. Note that the qualification statistic is generally not an integer, but a real number. The extra precision of real numbers effectively guarantees that no two players will have the same qualification statistic. (When reporting qualification statistics to the public, it may be convenient to round or truncate the qualification statistic to an integer, except in cases where more precision is needed to break a tie between two players.) The C programmed formula for P(Win) is listed in Appendix 2 and is available at . ------------------------------------------------------------------ Appendix 1: Summary of Committee Opinions of QSAs ------------------------------------------------------------------ Here is a summary of voting on QSAs as done after the QSSIM simulation results were available. This voting should replace that in our original report. Most of the changes were in the Accuracy column, but some members decided to make small changes elsewhere as well. Column headings: ACC: Accuracy AOF: Appearance of Fairness ENC: Encouragement CHT: Cheatproofing CAL: Calculability OVERALL: Overall CONSENSUS: Weighted average of first six columns, using consensus weights WHO: Number of RC member giving said ratings (listed anonymously in a random order to avoid harassment of individual committee members), or AVG to designate average of all six QSAs: PEAK: Peak Rating 1997: QSA used for US selections to 1997 Worlds FINAL: Final Rating OPR: Iterated or Uniterated Overall Performance Rating 3 SAMPLE: Average of Three Samples PEAK ACC AOF ENC CHT EXP CAL OVER CON 6 3 8 5 8 7 7 5.54 #1 4 3 10 5 9 6 4 4.37 #2 6 3 10 1 9 10 6 5.51 #3 7 2 10 5 10 10 7 6.2 #4 0 0 10 0 9 9 1 0.95 #5 6 4 10 1 10 10 6 5.75 #6 4.83 2.5 9.67 2.83 9.17 8.67 5.17 4.72 AVG 1997 ACC AOF ENC CHT EXP CAL OVER CON 6 3 8 6 1 2 6 5.26 #1 3 2 4 5 2 3 2 2.91 #2 6 3 10 3 3 6 5 5.33 #3 7 2 7 5 10 10 6 6.05 #4 3 4 10 0 3 7 4 3.44 #5 4 3 10 1 4 5 5 3.96 #6 4.83 2.83 8.17 3.33 3.83 5.5 4.67 4.492 AVG FINAL ACC AOF ENC CHT EXP CAL OVER CON 8 7 6 6 8 10 8 7.62 #1 7 6 6 5 10 10 7 6.8 #2 8 3 5 9 10 10 6 7 #3 7 2 4 5 10 10 6 5.9 #4 3 3 4 10 10 10 3 3.75 #5 8 6 6 8 10 10 8 7.6 #6 6.83 4.5 5.17 7.17 9.67 10 6.33 6.445 AVG OPR ACC AOF ENC CHT EXP CAL OVER CON 5 6 7 9 4 3 5 5.44 #1 8 8 8 6 5 2 7 7.72 #2 5 6 5 10 2 0 5 5.28 #3 7 5 7 9 10 9 7 6.84 #4 9 8 8 9 7 1 8 8.59 #5 9 9 5 6 9 4 9 8.6 #6 7.17 7 6.67 8.17 6.17 3.17 6.83 7.078 AVG 3 SAMPLE ACC AOF ENC CHT EXP CAL OVER CON 7 4 8 10 3 5 6 6.42 #1 7 6 7 5 7 6 6 6.69 #2 7 6 6 9 8 10 7 6.92 #3 7 4 5 5 10 10 6 6.35 #4 4 4 4 10 10 10 5 4.6 #5 6 3 3 8 8 6 5 5.43 #6 6.33 4.5 5.5 7.83 7.67 7.83 5.83 6.068 AVG Summary of above statistics: method avg. overall avg. consensus-weighted 1st-places avg. rank ------ ------------ ----------------------- ---------- --------- OPR 6.83 7.08 3 2.42 FINAL 6.33 6.45 1.5 2.5 AVERAGE 5.83 6.07 1 3.0 PEAK 5.17 4.72 0.5 3.0 1997 4.67 4.49 0 4.08 Comparing the top three systems head to head: FINAL vs OPR: 3.5 preferring OPR IOPR vs AVG: 4 to 2 preferring OPR AVG vs FINAL: 3.5 prefer FINAL ------------------------------------------------------------------ Appendix 2: Programming for the NSA Ratings Curve ------------------------------------------------------------------ (Also available at ) // C functions for calculating the NSA Ratings Curve // (The cumulative normal distribution) // To find P(Win|Rating1,Rating2), as used in the NSA ratings // formula (as of Jan. 10, 1998), call // CumNormal(Rating1-Rating2,0,200.*sqrt(2)), if abs(Rating1-Rating) <=700 // CumNormal(signof(Rating1-Rating2)*(Rating1-Rating2),0,200*sqrt(2)) otherwise double CumNormal(double x, double mean, double stddev) // The cumulative standard normal distribution { return(CumStandardNormal((x-mean)/stddev)); } double CumStandardNormal(double x) // The cumulative standard normal distribution { return( (1.0 - 0.5*erfcc(x/sqrt(2.0))) ); } // erfcc() from the book __Numerical Recipes__ // erfcc() constants: #define ERFCC_1 -1.26551223 #define ERFCC_2 1.00002368 #define ERFCC_3 0.37409196 #define ERFCC_4 0.09678418 #define ERFCC_5 -0.18628806 #define ERFCC_6 0.27886807 #define ERFCC_7 -1.13520398 #define ERFCC_8 1.48851587 #define ERFCC_9 -0.82215223 #define ERFCC_10 0.17087277 double erfcc(double x) // from the book __Numerical Recipes__ // modified by Robert L. Parker { double t,z,ans; z=fabs(x); t=1.0/(1.0+0.5*z); ans=t*exp(-z*z+ERFCC_1+t* (ERFCC_2+t* (ERFCC_3+t* (ERFCC_4+t* (ERFCC_5+t* (ERFCC_6+t* (ERFCC_7+t* (ERFCC_8+t* (ERFCC_9+t* ERFCC_10))))))))); return x >= 0.0 ? ans : 2.0-ans; } ------------------------------------------------------------------ Appendix 3: Some of the Results from QSSIM ------------------------------------------------------------------ Of the QSAs tested with QSSIM, we present here the following eleven: 1. OPRmleHI (iteration for robots only, ratings curve stddev = estimated from data, Opp Strength = Max rating during Qualifying Period) 2. OPR (no iteration, ratings curve stddev = same as in ratings formula) 3. OPRmle (no iteration, ratings curve stddev = estimated from data) 4. OPRmleI (iteration for robots only, ratings curve stddev = estimated from data) 5. IOPR (iteration for all players who have #games >= 50, ratings curve stddev = same as in ratings formula) 6. IOPRmle (same as IOPR, but ratings curve stddev = estimated from data) 7. HI (Peak rating during Qualifying Period) 8. AVG3_g (Average of 3 ratings, weighted by number of games played in each period.) 9. AVG3_123 (Avg of 3 ratings, weights 1,2,3 for successive periods.) 10. PAR (Peak average rating: NSA's 1997 WSC qualifying system.) 11. RAT (Current rating at end of Qualifying Period) For reasons to be explained in Robert Parker's thesis, the curve used in the calculations varied slightly from the usual ratings curve. The difference was shown to be extremely minor. (In particular, he used the logistic curve described by P()=1/(1+(-c*(r1-r2))) where r1 and r2 are ratings, exp is the exponential function, and c is a constant set to 1/172 to closely approximate the usual ratings curve.) Of several statistics reported by QSSIM, we present here the following two: 1. n-out-of 10: The number of the known Top 10 robots who are ranked among the Top 10 by the QSA. Higher is better. 2. The Rank Difference (RD): The robots are ranked 1,..,32. The sum of the ranks for Group1 (robots playing more difficult schedules) is calculated. The same is done for Group2. Then RD = Group2 average - Group1 average. A positive RD indicates that Group1 was ranked better than Group2; a negative RD indicates the reverse. Nearer to zero is better. More complex, complete statistics were also collected; the results with those statistics are essentially identical to those with the simpler statistics given above. Here, then, are averages and 90% confidence intervals for the two statistics for each of the eleven methods. n-out-of-10 Rank Difference Favors Upper Lower Ave Upper Lower Ave OPRmleHI 7.163 7.019 7.091 -3.71 -8.46 -6.09 2 OPR 7.154 7.011 7.083 2.86 -1.79 0.53 - OPRmle 7.157 7.013 7.085 11.14 6.43 8.78 1 OPRmleI 7.157 7.013 7.085 7.22 2.47 4.84 1 IPR 7.150 7.005 7.078 -6.24 -11.00 -8.62 2 IPRmle 7.151 7.006 7.079 -4.44 -9.20 -6.82 2 HI 7.158 7.015 7.086 4.77 0.04 2.40 1 AVG3_g 7.361 7.219 7.290 5.16 0.75 2.96 1 AVG3_123 7.456 7.284 7.370 5.14 -0.74 2.20 - PAR 7.352 7.158 7.255 7.64 1.06 4.35 1 RAT 7.551 7.411 7.481 9.06 4.91 6.98 1 The "Favors" columns tells which of Group1 (strong schedules) or Group2 (weaker schedules) is favored by the QSA.