MORE ON QUALIFYING SYSTEMS: A REPORT OF THE NSA RATINGS COMMITTEE


Release version, January 13, 1998


Steven Alexander
John Chew
Jerry Lerman
Robert Parker
Dan Stock, chair
Jeff Widergren

------------------------------------------------------------------
Part I: Overview
------------------------------------------------------------------

In November of 1997, the NSA Ratings Committee (RC) released a
report detailing our months of study on the subject of building
a robust qualifying system (QS). A qualifying system chooses a
subset of players to play in certain elite tournaments. A QS consists
of certain logistical rules together with a qualifying system algorithm
(QSA). In our earlier report, we recommended a QSA called Iterated
Overall Performance Rating (IOPR). IOPR's advantages, enumerated in
our earlier report, had made it far more appealing as a selection
basis than any of the other systems explored.

Seven criteria were considered in the earlier recommendation
of IOPR. One criterion, Feasibility, was judged vital; however, all
the competing QSs were considered feasible, so they were judged
on six other criteria. IOPR had fared particularly well in two of
the most important criteria, Accuracy and Appearance of Fairness.

We had decided to have members judge possible QSAs in two ways:
both by their own weighting of the criteria, and by a compromise
weighting scheme. We consider it important that the final QSA
should not only be good when judged by the compromise scheme,
but also that a majority of the members find that it is good when
judged by their own favorite scheme. IOPR did well in both ways in
our earlier report.

In that report, we mentioned that there was a significant last step
that we wished to perform: finalizing the details of the specific
IOPR algorithm to be used by simulation using data from actual
tournament schedules. It has taken a few months of hard, inventive,
and highly intelligent work by committee member, Robert Parker, to
put the simulation (QSSIM) together. (Robert's simulation process is
explained in some detail in Part II of this document.) While he was at
it, Parker also simulated results from the other QSAs we had considered.

The results were quite surprising, and did not agree with the
results of the earlier, simpler simulations that we had run.
Most of the variations of IOPR ended up performing similarly.
Overall, IOPR ended up as being slightly less accurate than the
other systems, at least in terms of this particular simulation.
The Final Rating QSA surfaced as the best predictor by a small
margin; the Average of Three Samples and the qualifying system
used by the US for the 1997 Worlds also performed better than IOPR,
but by an even smaller margin; Peak Rating finished about the
same as the IOPR variants. Some of the margins were small enough
to perhaps not be statistically significant, though the difference
between Final Rating and IOPR did turn out to be significant but
still small.

For reasons that will be explained in Parts II and III, we ended up
favoring an uniterated version of IOPR -- that is, OPR.

The qualifying system simulation (QSSIM) was quite sophisticated,
but it was not without its limitations (just as we see in the
simulations used to play Scrabble Brand Crossword Game). One of
the most salient of these may be that the simulation could
not account for the likelihood that competitors will alter their
playing itineraries to obtain best results under any QSA which might be
so manipulable. There might therefore be a tendency for this simulation
to overrate most of the QSAs and underrate the OPR system, which has the
least chance of being "gamed" by tournament selection. (A case in point:
some top experts will only play in tourneys where there are many other
top experts, which reflects a perception that the current rating system
is more favorable to them with such a schedule.) And indeed, as one
committee member put it, "If you have only a hundred or so of a player's
games to look at, then the systems that give the most accurate estimate
of the player's skill [using the simulation] are unfortunately those
which would succumb most easily to manipulation by careful choice of
tournament attendance."

Following the simulation results, the RC members reranked the QSAs on
the characteristics agreed as most important.  The revote resulted in
a much closer vote (see Appendix 1 for full details), but with OPR
(or IOPR) retaining its status as the preferred choice. Since the
vote was close and the choice of OPR (or IOPR) was not nearly as
clear as in earlier voting, we decided, by a 4-2 vote, to partially
rescind our earlier recommendation solely for IOPR and replace it
with the following recommendation:

RECOMMENDATION: The QSA should be either OPR (as detailed in Part III of
                this report) or Final Rating, with a preference for OPR.

(Note: the two dissenting voters preferred recommending OPR only.)

The Final Rating QSA came in second in our voting by a fairly small
margin, largely due to its strong performance in the QSSIM simulation.
Overall, the committee found that OPR's superiority in the
Appearance of Fairness (especially due to the fact that it is
difficult to manipulate) outweighed Final Rating's small edge in
accuracy as judged by QSSIM. However, if some logistical or other
problem made adoption of OPR difficult or expensive, the NSA
might consider using Final Rating instead.

The RC also wants to make clear that the other recommendations
from our earlier report still stand. These include a number of
recommendations for the logistics of the system, including such
important ideas as: the length of the qualification period; a
minimal number of games to be played in that period; a Currency
requirement that some games be played in the final months of the
period; requalification rights for previous strong performers; a
qualification tourney; and others. See our previous report for
details.

Of particular note is the Currency requirement. With this requirement,
the Final Rating system described here is not the same system as that
used in 1993, for at that time there was no such requirement: players
were allowed to "sit" on ratings for an unlimited period.

------------------------------------------------------------------
Part II: The QSSIM Simulator
------------------------------------------------------------------

The QSSIM simulator was designed and implemented by Robert Parker
to compare QSAs, especially QSAs related to OPR. It uses a combination
of real game and schedule data with artificial "robots". The reason for
the robots is that one can set their actual strength and then see
how well a given QSA ranks the robots in comparison to their actual
strength. The same approach could not be done with players due to the
impossibility of accurately saying what a player's strength really is:
after all, judging a player's strength is just what a QSA is trying to do.

The simulator uses a data set that consists of over 100,000 games played
between over 3000 players over a four year period between 1992 and 1996.
So that the robots would have realistic schedules, the simulation
replaced 32 high-rated players with robots. (Forgive the straight
line!) The robots were divided into two groups, according to the
average rating of the opponents of the player replaced by each robot.
The top group of 16 had average opponent ratings over 1900; the
other group had average opponent ratings in the 1800s or the
high 1700s. Each group of sixteen had each of their robots given
different actual strengths ranging from 1950 up to 2100 (on the tens).

In place of each of the replaced players' games, a simulated game was
played; a curve that approximates the observed results of real games
was used to determine the winner probabilistically. The data from 1992
to 1995 was used to supply each robot with a playing rating which
intentionally only approximated its actual strength. Then, the last year
of data (1995-1996) was treated as a qualifying period. Each of the replaced
players had played enough games in the qualifying period for the statistics
to be meaningful.

After all the games with a robot had been simulated (over 8000 of them,
including nearly 1000 games where two robots played, simulated 800 times),
various statistics were gathered as to how each of many different QSAs
and their variants ranked the robots, and how they compared to the
robots' actual strengths. One of these statistics was how many of the top ten
robots (by actual strength, as simulated) made the top ten (according to each
QSA). Another statistic indicated which group of sixteen robots did better and
by how much; that statistic allowed us to judge whether a method was biased
toward those who play stronger or weaker players.

A dozen different QSA variants were tested, about half of which were
variations of OPR. One or more variants of each the other five
QSAs in our initial report were also tested.

Surprisingly, the results showed that all of the QSAs did quite
similarly with regard to accuracy. For example, all of them averaged
between 7 and 7.5 of the strongest robots finishing in the top ten.
The Final Rating method finished with the highest accuracy, but showed
a clear preference for players with stronger schedules.
All of the OPR variations finished with no statistically significant
difference in accuracy. However, only two of the QSAs did not show a
significant bias toward either group: the version of OPR that we are
recommending and a time-weighted version of the Average of Three Samples
method. That version of the Average of Three Samples method also
finished second in accuracy, just tiny amount behind the Final Rating
method. (One committee member, who has supported the Average of Three
Samples method from the beginning, wonders why this new evidence does
not seem to impress other members of the committee.) Applying iteration
to the OPR method (IOPR), as we had initially recommended, gave a significant
bias in favor of players with weaker schedules. The Peak Rating method
was comparable in accuracy to the OPR variants, with a slight bias in
favor of those with stronger schedules. The QSA used for USA qualifying
for the 1997 Worlds was about halfway between OPR and the Final Rating
method in accuracy, with a significant preference for players having
stronger schedules. For more complete QSSIM results, see Appendix 3.

The results of QSSIM were interpreted in various ways by the committee.
One committee member (#4 in the Appendix) used it to give all of the
methods a grade of 7 for Accuracy. Two committee members (#1 and #3)
used it with a finer grain, giving Accuracy ratings from 5 to 8 mostly
relating to QSSIM results. The remaining three committee members based
their ratings for Accuracy only partially or not at all on QSSIM, some
citing its weakness in not allowing for players deciding whether to
attend a tourney based on how attending would affect their chances of
qualifying (e.g., the tendency to sit on a high rating in a Final Rating
system).

For more details on QSSIM, see <http://math.unm.edu/~rparker>.

------------------------------------------------------------------
Part III: Details of OPR, our Recommended QSA
------------------------------------------------------------------

Before giving the technical details of our specific version
of OPR, we discuss some of the of the options that were
considered. In general, OPR-like systems attempt to solve,
for each Player i, the following equation:
         Expected_Wins[i] = Actual_Wins[i]

Expected_Wins[i] depends on:
  1. Player i's OPR, the independent variable.
  2. Opponent Strength: the skill level of Player i's opponents
  3. Ratings Curve: the function which gives the probability that a
     player will win a game, given the difference between that player's
     rating and the player's opponent's rating.
There have been several suggested values for #2 and #3.

For Opponent Strength, possibilities include:
  - Pre-Tourney rating - the opponent's rating when the tourney starts.
  - Post-Tourney Rating, the opponent's rating after the tourney ends.
  - Peak rating during the qualifying period.
  - Average rating during the qualifying period.
  - Opponent's Performance Rating from an earlier iteration of the
    OPR process (This defines the Iterated OPR process.)
  - Some weighted average of the above methods or mixture of the above
    methods depending on the number of games played during the
    qualifying period.

For the Performance Ratings Curve, possibilities include:
  - The Ratings Curve - The same probability formula used in determining
    ratings
  - The Empirical Ratings Curve - A probability formula estimated from actual
    game data.

As mentioned in Part II, QSSIM suggests that the performance of the OPR
system, as measured by its ability to select the best players, is about
the same regardless of which choices are used for Opponent Strength and
Performance Ratings Curve.  It should be noted, however, that the simulation
depended on a rather non-responsive behavior of the players - it assumed that
players would not vary their tournament schedules in response to the
implementation of the OPR Qualifying System. For this reason, the particular
choice of Opponent Strength and Performance Ratings Curve are made with
several criteria (Cheatproofing, Appearance of Fairness, ...) in mind.

In particular, the choice we made seemed to do the best job of
assuring that a stronger or weaker playing schedule did not effect
one's chances of qualifying. Interestingly, the Iteration technique
seemed to introduce more biases than it solved, so we are now
recommending an uniterated technique (OPR rather than IOPR).

We have chosen the following OPR implementation:

Opponent Strength will be measured by the post-tourney rating.

The Performance Ratings Curve will be usual NSA Ratings Curve.

This specification of OPR is the same as the Performance Rating already
familiar to players, with these exceptions:
  1.  The "tournament" in this case is the entire qualification period.
  2.  The measure of Opponent Strength is the opponent's rating after
      each individual tournament (not the whole qualification period),
      instead of their pre-tourney rating. (Note that this means that
      Opponent Strength is usually not constant during the qualification
      period for any given opponent, as the opponent's rating typically
      will vary over the course of a year.)

Here, then, are the details of our recommended QSA.

To qualify, a player must play at least 70 games during the one-year
qualification period. At least 15 rated games must be played within
the final three months of the period, OR at least 25 rated games must
be played within the final 5 months of the period. Publication of a
player's qualification statistic should not occur unless the player has
played at least 30 games in the qualifying period, since the data might
not be significant with a smaller sample.

A player with no losses is given a qualification statistic 500 points
greater than the highest of her/his opponent's ratings. A player with
no wins is given a qualification statistic 500 points lower than the
lowest of her/his opponent's ratings. Otherwise, to calculate the
qualification statistic for Player,

1.  List all games played by Player (other than byes and forfeits)
    during the qualifying period, including:
    a.  The post-tourney rating for each opponent played by Player, and
    b.  A numeric value for the Outcome of each game: 1 for a win,
        0.5 for a tie, and 0 for a loss.

2.  Find the total Actual Number of Wins by Player by adding all the
    Outcomes from step 1b.

3.  Use a search method to find the rating (the qualification statistic)
    at which Player would have to be for his/her Expected Number of Wins
    to be equal to his/her Actual Number of Wins (from step 2). For a
    given rating, the Expected Number of Wins is calculated by adding the
    probabilities of winning each game. These probabilities are determined,
    in turn, from the difference between the given rating and the
    opponent's rating (from step 1a) using the same "ratings curve" used
    by the current rating system. Note that the qualification statistic
    is generally not an integer, but a real number. The extra precision
    of real numbers effectively guarantees that no two players will have
    the same qualification statistic. (When reporting qualification
    statistics to the public, it may be convenient to round or truncate
    the qualification statistic to an integer, except in cases where
    more precision is needed to break a tie between two players.)

The C programmed formula for P(Win) is listed in Appendix 2 and is
available at <http://math.unm.edu/~rparker>.

------------------------------------------------------------------
Appendix 1: Summary of Committee Opinions of QSAs
------------------------------------------------------------------

Here is a summary of voting on QSAs as done after the QSSIM simulation
results were available. This voting should replace that in our original
report. Most of the changes were in the Accuracy column, but some members
decided to make small changes elsewhere as well.

Column headings:
  ACC: Accuracy
  AOF: Appearance of Fairness
  ENC: Encouragement
  CHT: Cheatproofing
  CAL: Calculability
  OVERALL: Overall
  CONSENSUS: Weighted average of first six columns,
             using consensus weights
  WHO: Number of RC member giving said ratings (listed anonymously
       in a random order to avoid harassment of individual committee
       members), or AVG to designate average of all six

QSAs:
  PEAK: Peak Rating
  1997: QSA used for US selections to 1997 Worlds
  FINAL: Final Rating
  OPR: Iterated or Uniterated Overall Performance Rating
  3 SAMPLE: Average of Three Samples


PEAK
ACC    AOF    ENC    CHT    EXP    CAL      OVER   CON
 6      3      8      5      8      7        7      5.54     #1
 4      3     10      5      9      6        4      4.37     #2
 6      3     10      1      9     10        6      5.51     #3
 7      2     10      5     10     10        7      6.2      #4
 0      0     10      0      9      9        1      0.95     #5
 6      4     10      1     10     10        6      5.75     #6

 4.83   2.5    9.67   2.83   9.17   8.67     5.17   4.72     AVG


1997
ACC    AOF    ENC    CHT    EXP    CAL      OVER   CON
 6      3      8      6      1      2        6      5.26     #1
 3      2      4      5      2      3        2      2.91     #2
 6      3     10      3      3      6        5      5.33     #3
 7      2      7      5     10     10        6      6.05     #4
 3      4     10      0      3      7        4      3.44     #5
 4      3     10      1      4      5        5      3.96     #6

 4.83   2.83   8.17   3.33   3.83   5.5      4.67   4.492    AVG


FINAL
ACC    AOF    ENC    CHT    EXP    CAL      OVER   CON
 8      7      6      6      8     10        8      7.62     #1
 7      6      6      5     10     10        7      6.8      #2
 8      3      5      9     10     10        6      7        #3
 7      2      4      5     10     10        6      5.9      #4
 3      3      4     10     10     10        3      3.75     #5
 8      6      6      8     10     10        8      7.6      #6

 6.83   4.5    5.17   7.17   9.67  10        6.33   6.445    AVG

OPR
ACC    AOF    ENC    CHT    EXP    CAL      OVER   CON
 5      6      7      9      4      3        5      5.44     #1
 8      8      8      6      5      2        7      7.72     #2
 5      6      5     10      2      0        5      5.28     #3
 7      5      7      9     10      9        7      6.84     #4
 9      8      8      9      7      1        8      8.59     #5
 9      9      5      6      9      4        9      8.6      #6

 7.17   7      6.67   8.17   6.17   3.17     6.83   7.078    AVG


3 SAMPLE
ACC    AOF    ENC    CHT    EXP    CAL      OVER   CON
 7      4      8     10      3      5        6      6.42     #1
 7      6      7      5      7      6        6      6.69     #2
 7      6      6      9      8     10        7      6.92     #3
 7      4      5      5     10     10        6      6.35     #4
 4      4      4     10     10     10        5      4.6      #5
 6      3      3      8      8      6        5      5.43     #6

 6.33   4.5    5.5    7.83   7.67   7.83     5.83   6.068    AVG

Summary of above statistics:

   method  avg. overall  avg. consensus-weighted  1st-places  avg. rank
   ------  ------------  -----------------------  ----------  ---------
   OPR          6.83            7.08                   3         2.42
   FINAL        6.33            6.45                   1.5       2.5
   AVERAGE      5.83            6.07                   1         3.0
   PEAK         5.17            4.72                   0.5       3.0
   1997         4.67            4.49                   0         4.08

Comparing the top three systems head to head:
    FINAL vs OPR: 3.5 preferring OPR
    IOPR vs AVG: 4 to 2 preferring OPR
    AVG vs FINAL: 3.5 prefer FINAL

------------------------------------------------------------------
Appendix 2: Programming for the NSA Ratings Curve
------------------------------------------------------------------

(Also available at <http://math.unm.edu/~rparker> )

//  C functions for calculating the NSA Ratings Curve
//          (The cumulative normal distribution)
// To find P(Win|Rating1,Rating2), as used in the NSA ratings
// formula (as of Jan. 10, 1998), call
//   CumNormal(Rating1-Rating2,0,200.*sqrt(2)), if abs(Rating1-Rating) <=700
//   CumNormal(signof(Rating1-Rating2)*(Rating1-Rating2),0,200*sqrt(2))
         otherwise

double CumNormal(double x, double mean, double stddev)
// The cumulative standard normal distribution
{
return(CumStandardNormal((x-mean)/stddev));
}

double CumStandardNormal(double x)
// The cumulative standard normal distribution
{
return( (1.0 - 0.5*erfcc(x/sqrt(2.0))) );
}

// erfcc() from the book __Numerical Recipes__
// erfcc() constants:
#define ERFCC_1  -1.26551223
#define ERFCC_2   1.00002368
#define ERFCC_3   0.37409196
#define ERFCC_4   0.09678418
#define ERFCC_5  -0.18628806
#define ERFCC_6   0.27886807
#define ERFCC_7  -1.13520398
#define ERFCC_8   1.48851587
#define ERFCC_9  -0.82215223
#define ERFCC_10  0.17087277
double erfcc(double x)
// from the book __Numerical Recipes__
// modified by Robert L. Parker
{
double t,z,ans;
z=fabs(x);
t=1.0/(1.0+0.5*z);
ans=t*exp(-z*z+ERFCC_1+t*
              (ERFCC_2+t*
              (ERFCC_3+t*
              (ERFCC_4+t*
              (ERFCC_5+t*
              (ERFCC_6+t*
              (ERFCC_7+t*
              (ERFCC_8+t*
              (ERFCC_9+t*
               ERFCC_10)))))))));
return  x >= 0.0 ? ans : 2.0-ans;
}

------------------------------------------------------------------
Appendix 3: Some of the Results from QSSIM
------------------------------------------------------------------

Of the QSAs tested with QSSIM, we present here the following eleven:

   1. OPRmleHI  (iteration for robots only,
                 ratings curve stddev = estimated from data,
                 Opp Strength = Max rating during Qualifying Period)
   2. OPR (no iteration, ratings curve stddev = same as in ratings formula)
   3. OPRmle (no iteration, ratings curve stddev = estimated from data)
   4. OPRmleI (iteration for robots only,
               ratings curve stddev = estimated from data)
   5. IOPR  (iteration for all players who have #games >= 50,
             ratings curve stddev = same as in ratings formula)
   6. IOPRmle (same as IOPR, but ratings curve stddev = estimated from data)
   7. HI (Peak rating during Qualifying Period)
   8. AVG3_g (Average of 3 ratings, weighted by number of games played in
              each period.)
   9. AVG3_123 (Avg of 3 ratings, weights 1,2,3 for successive periods.)
  10. PAR (Peak average rating: NSA's 1997 WSC qualifying system.)
  11. RAT (Current rating at end of Qualifying Period)

For reasons to be explained in Robert Parker's thesis, the curve used
in the calculations varied slightly from the usual ratings curve.
The difference was shown to be extremely minor. (In particular, he used
the logistic curve described by
         P()=1/(1+(-c*(r1-r2)))
where r1 and r2 are ratings, exp is the exponential function, and c is a
constant set to 1/172 to closely approximate the usual ratings curve.)

Of several statistics reported by QSSIM, we present here the following two:

1. n-out-of 10: The number of the known Top 10 robots who are ranked among
   the Top 10 by the QSA. Higher is better.
2. The Rank Difference (RD):  The robots are ranked 1,..,32.
   The sum of the ranks for Group1 (robots playing more difficult
   schedules) is calculated.  The same is done for Group2. Then
   RD = Group2 average - Group1 average. A positive RD indicates
   that Group1 was ranked better than Group2; a negative RD indicates
   the reverse. Nearer to zero is better.

More complex, complete statistics were also collected; the results with
those statistics are essentially identical to those with the simpler
statistics given above.

Here, then, are averages and 90% confidence intervals for the two
statistics for each of the eleven methods.

             n-out-of-10     Rank Difference     Favors

          Upper Lower Ave    Upper  Lower  Ave
OPRmleHI  7.163 7.019 7.091  -3.71  -8.46 -6.09     2
OPR       7.154 7.011 7.083   2.86  -1.79  0.53     -
OPRmle    7.157 7.013 7.085  11.14   6.43  8.78     1
OPRmleI   7.157 7.013 7.085   7.22   2.47  4.84     1
IPR       7.150 7.005 7.078  -6.24 -11.00 -8.62     2
IPRmle    7.151 7.006 7.079  -4.44  -9.20 -6.82     2
HI        7.158 7.015 7.086   4.77   0.04  2.40     1
AVG3_g    7.361 7.219 7.290   5.16   0.75  2.96     1
AVG3_123  7.456 7.284 7.370   5.14  -0.74  2.20     -
PAR       7.352 7.158 7.255   7.64   1.06  4.35     1
RAT       7.551 7.411 7.481   9.06   4.91  6.98     1

The "Favors" columns tells which of Group1 (strong schedules) or
Group2 (weaker schedules) is favored by the QSA.