Mathematical flaws in the ICC ratings
There is something almost fun about the way the ICC ratings reset every August. For a system designed to be simple the random wanderings of discarded results cause endless sniping and confusion. Further compounded by their T20 ratings that use such a paucity of data with teams so evenly matched, that any number of odd results turn up. In a previous post on this topic, I looked at some of the operational quirks inherent to the ratings.
In this one I want to demonstrate the mathematical flaws that cause those quirks. Lest that seem petty, it is worth noting that the ICC commissioned a (secret) report that determined that the ratings were fit for purpose when deciding qualification for major events. What I want to demonstrate here is that although it is actually quite hard to make a rating system flawed enough to produce really odd results, the ICC ratings are absolutely not fit for purpose, should they be used for qualification. Leaving aside the more general point, that things like qualification ought to be decided on the field, not via the questionable calculations of statistics nerds.
Problem 1: Linear predictions
It is worth recapping how the ratings are calculated, before deconstructing the problems with that approach. The ICC description over-complicates the formula, which can be simplified into two parts:
- The win points: the percentage of games won, regardless of opposition.
- The opposition bonus: which for teams separated by less than 40 points, is the opposition rating minus 50, and otherwise a team's own rating minus 90 (for the higher placed team) or a team's own rating minus 10 (for the lower).
Each year a team accumulates points according to that formula, divided by the number of games. Series results count as an extra game, so they don't affect the basic formula. As with all ratings systems, old results are diminished with time, but we'll get to that.
Using the standard formula, if a team has an accurate rating, then their expected win percentage against another side can be calculated from the ratings difference. this is a linear formula. Two sides rated 100 will score 50 opposition points and therefore need to win 50% of games to maintain stability. A side ranked 20 points higher will expect to win 70% of games; one 40 points higher: 90% of games. At which point the system breaks down. Any ratings difference greater than 50 would predict more than a 100% win rate. This being impossible, and therefore likely to cause ratings to decline even when a side wins every game, the ICC ratings have a second clause that projects an entirely arbitrary win expectation of 90% for games played outside those otherwise narrow bounds.
The root of these difficulties lies in the linear projection from ratings difference to expected wins. It ought to be obvious that as the ratings difference increases, the probability of the weaker team win will diminish to zero, but not cross that barrier. The ICC ratings (blue line) are too variable for evenly matched teams, and compress the ratings of sides around the 90% win mark, eventually flat-lining when it ought to be a slow curve. An examination of the distribution of margins shows that they are basically normal around the expectation, and therefore win probability is better calculated with a cumulative normal function, based on ratings difference (the red line) - or for simplicity a curved or segmented linear approximation.
A ratings formula that used this property would be stable for all teams included in the ratings. The ICC ratings are, by contrast, not stable with respect to predictions.
Problem 2: Infinite and Zero Bounds
Once the ratings difference crosses 40, the expected win percentage for the better team is capped at 90%. This has two effects: firstly, it makes it impossible to accurately rate a team within this no-mans land of mediocrity (or greatness). There is no difference in expectation between a side rated 50 behind and one rated 100 behind, so there is no way to know if a side should move up or down from their rated position given a string of results with sub-10% win percentage. Where teams have more evenly spaced ratings - such as the T20 table - it is theoretically possible to adjust a team in relation to its immediate rivals, but in test cricket, where Bangladesh are well below a 10% win percentage, their rating is mathematically meaningless.
Secondly, the up-shot of expecting a 90% win percentage when a team is actually winning greater than 90% of games, is that the higher team's rating will increase indefinitely; and vice-versa, the lower ranked team wil be driven to zero. It would be less, but the ratings are artificially bound at zero; this is itself a problem when rating associates because the weakest (zero-bound) associate will almost certainly be more than (a cumulative) 100 point rating difference below the major test teams, meaning the 100 centring is probably too low.
Problem 3: Oscillations caused by adjustments
The effect of infinite bounds and linear predictions causes makes the system highly unstable for the lower-ranked test teams and associate nations. But it is not the cause of more recent random results. That is related to the method used for removing old results.
To understand this, consider the simplified case of a two team system playing one game/series per year. Imagine that prior to year 1, the first team wins exactly 90% of all games played between the two sides. Over time, their ratings will converge to 120 (for the stronger side) an 80 (for the weaker), a difference of 40. Now suppose that, in year 1, and every year after, the two sides are equal, winning 50% of games played. To accurately represent this change, the ratings ought to converge to 100, although it is a matter of preference how quickly this occurs.
The ratings do this by giving the stronger team fewer points because of the opposition bonus, which drags them down below the 100 mark. That is, the higher rated team gets (80-50) opposition points and 50 win points, equal to 80. And the lower rated team gets (120-50) opposition points and 50 win points, equal to 120.
The ICC reduce the power of old results by halving the points value after 1-2 years, and removing results after 3-4 years, calculated each August. This causes some significant problems. Four methods of diminishing old results are presented below to demonstrate the problem.
Method 1: 50% averaging Averaging the most recent year with the rating at the start of the year. Weirdly enough this works perfectly adequately. Because the latest game mirrors the change required, both teams move directly to 100, and then stay there. (blue line).
Method 2: with removal Alternatively, a simple method that reflects the ICC approach is to average the points accumulated over only the previous two years, removing results older than that. What happens here (red line) is that in the first year, the results average out 120,80 and 80,120, which gives the correct result. However in the second year, where both teams accumulate 100 points, the removed result (120) pushes the higher rated tea down to 90 (80,100), before oscillating back to 105 (100,110), and continuing for several years.
Recalling that the ICC effectively uses a year to year points calculation, the prospect of it oscillating is real. Though to be fair, it is not quite that pronounced.
Method 3: 33% averaging Even though 50% averaging is perfectly reasonable, most people would consider that it gives too much prominance to recent results, effectively only the previous year. An alternative is shown here (blue line) that uses 2/3 of the rating at the start of the year, and 1/3 the previous year. It converges more slowly on the 100 point mark, taking 4 years to complete the move (107,102,101,100).
Method 4: ICC averaging The ICC takes a more complex, and needless to say broken aproach to its averaging. Over the three year period calculated each August, it applies 50% weighting to the results from the first two recorded years and the full weighting from the most recent. This produces a very odd result. In years 1 and 2, the results average out to 100 (120,120,80x2) and (120,80,100x2). But in year 3, when the last of the good results are removed, the team drops below 100 to 95, despite never playing at that level (80,100,100x2). That induces an oscillation that is still distingushable until fully nine years after the change in playing strength.
This is exactly what we see in the recent rating change. Australia get a fourth year bounce oscillating up from the knock-on effect of their rating drop in 2009. India, and more particularly, South Africa receive little kicks from the same period, while England receive a small bounce for the Ashes the following year. None of which has any relation to recent results. It is nothing but artifacts of old changes, insufficiently balanced out.
Problem 4: Over-sensitivity
The final issue, which is particularly important in relation to qualifying, is that, as the averaging shows, while the ratings include results from as long as three years ago, they are so sensitive, the previous 12 months of results are the over-whelming contributor to the results. The oscillation, while potentially important, only appears on the graph because the ratings are being held stable.
Thus, a team with a particularly difficult or easy tour in the leadup to the qualifying cut-off will have or lose a significant advantage. Recalling that the ICC makes no allowance for home advantage the order of tours, and the exact (as yet unknown) date for qualifying might have a significant impact on who makes the ICC's major tournaments. Add in an element of randomness via mathematical nonsense and there is absolutely no way they should be used for the purpose being proposed. The fact that a study was conducted that has supposedly concluded the opposite ought to raise some serious questions about the quality of their independent advice.
There are very good, very simple ratings systems available. The IRB have a very sensible approach*, and include all their members - of whom the ongoing absence from the ICC ratings ought to be an embarassment. But even they are not so unwise as to use the ratings for global qualifying. At some point, putting faith in ratings that are broadly untrusted and produce odd results will cause the ICC more headaches than they are worth.
* Actually the IRB also use a linear projection, but introduced a different quirk, capping rating changes for wins against team's more than 10 points apart. This too is an unnecessary flaw, as it holds teams within the orbit of those they regularly play, suppressing New Zealand's rating, and keeping Italy, Scotland (and maybe now Argentina) from dropping as far as they sometimes might. it is partially mitigated by rugby's (slightly) more open policy towards playing weaker sides.
26th July, 2012 02:51:16