Fixing the ICC ratings
Russell Degnan

In my post on the fundamental sameness of ratings I implied some criticism of the ICC ratings. Many choices about how to construct a ratings system are (for the most part) either a design choice - home advantage doesn't matter with a large sample and even schedule - or relate to what is trying to be achieved. The decay rate will be different if a rating is supposed to reflect the last 2 months versus the previous two years.

The ICC ratings go to a championship trophy and should therefore reflect the previous 12 months, but with scheduling so uneven that is near impossible, and different choices have been made to provide a relatively simple system.

As discussed in a previous post however, the ICC ratings have some genuine problems. The choice to cap the implied probability at 90% means that for a large number of matches the ratings are a poor reflection of the quality of the sides. Similarly, the choice of decay that reduces then drops previous results causes other issues when the quality of opposition has already been accounted for.

Both of these issues are relatively easy to fix, and this post discusses the benefits of doing so, particularly in a new world where nations with wildly different abilities must both be included in the ratings - as opposed to the full member oriented system where all teams were broadly at the same level.

Changing the implied probability

As noted, the basic issue with the ICC ratings' implied probability is that once teams are more than 40 ranking points apart the ratings assume that the stronger side will win 90% of matches. This pushes the ratings apart - particularly when one side is significantly weaker than their opponents. It also means that the points on offer for wins over strong sides are lower for bad sides than good ones - which limits the ability of the ratings to adapt to changes in ability.

As the graph above shows (the blue ICC lines), once the gap between teams gets above 40 points, their points gained relative to their current rating remain same. The value of a win therefore declines as the probability of them winning decreases. At its most extreme, when sides are rated more than 180 points apart, a strong side will get more points for losing a match than the weaker team will get for winnings it.

The solution is to adjust the points on offer in proportion to the ratings gap of the two teams, as per the red lines in the graph which eventually settle on the stronger side receiving no additional points (ie. their current rating) for a win - an implied probability of 100% - and the weaker team half the ratings gap plus 80 in the unlikely event they win.

The formulas would therefore be as follows:

Ratings gapICC FormulaProposed Formula
Stronger teamWeaker teamStronger teamWeaker team
0-40Win: OppRat + 50
Loss: OppRat - 50
Win: OppRat + 50
Loss: OppRat - 50
Win: OppRat + 50
Loss: OppRat - 50
Win: OppRat + 50
Loss: OppRat - 50
40-90Win: OwnRat + 10
Loss: OwnRat - 90
Win: OwnRat + 10
Loss: OwnRat - 90
Win: 0.1 * OppRat + 0.9 * OwnRat + 14
Loss: 0.6 * OppRat + 0.4 * OwnRat - 66
Win: 0.6 * OppRat + 0.4 * OwnRat + 66
Loss: 0.1 * OppRat + 0.9 * OwnRat - 14
90-180Win: OwnRat + 10
Loss: OwnRat - 90
Win: OwnRat + 10
Loss: OwnRat - 90
Win: 0.05 * OppRat + 0.95 * OwnRat + 9
Loss: 0.55 * OppRat + 0.45 * OwnRat - 71
Win: 0.55 * OppRat + 0.45 * OwnRat + 71
Loss: 0.05 * OppRat + 0.95 * OwnRat - 9
180 plusWin: OwnRat + 10
Loss: OwnRat - 90
Win: OwnRat + 10
Loss: OwnRat - 90
Win: OwnRat
Loss: 0.5 * OppRat + 0.5 * OwnRat - 80
Win: 0.5 * OppRat + 0.5 * OwnRat + 80
Loss: OwnRat

They look more complicated than they are. The existing ICC ratings use either a team's own rating or the opposition. The combination allows the much more gradual increase in points shown above (optimally the area between 0 and 40 would also be curved, but I have chosen to leave it as is).

The changed implied probability shows the benefits of this approach:

Whereas previously teams were either closely matched or a 90% chance of victory, now their approximate chance of victory can be determined across a full range of ratings gaps.

This change would only make subtle changes to the ratings. Bangladesh's improvement a few years ago would have given them a more rapid (and noticeable) boost, reflecting their actual ability rather than their long period of tepid performances. The odd associate upset would have been better reflected in their ratings - when they are included. But as these results are rare, the broader outline of the ratings would be the same. The more important change is to the decay rate.

Changing the decay rate

As a matter of basic maths, if points were to accumulate indefinitely then new matches will have a decreasing effect on the ratings. The ICC works around this in the simplest way - by reducing the previous two years by 50% and excluding anything before that. But it has an unfortunate side effect: each exclusion date, ratings jump, sometimes substantially, and often, in strange directions.

The effect of this change can be seen in a simple example. Here a team plays (and wins or loses matches) at different levels over the course of several years. The true rating of the team in each year (and which, nominally the ratings should reflect) is as follows: 100, 80, 100, 120, 120, 120, 100. The graph shows this shift (at the start of the year) and the impact of the ICC decay formula (at the end of each year).

Notice that, because the previous year is reduced to 50% in preparation for a new year, the rating shifts away from the true rating at the end of the second and third years as old results are re-weighted up relative to the past year. The ICC rating eventually meets the true rating only if the team has maintained the same rating for two years, otherwise it is often substantially far from correct.

The oddity with the simple choice of decay is that it is also unnecessary. The "natural" way to ensure old results do not impact the rating without unseemly jumps is to merely divide both the points accumulated and the number of matches by an amount. In the graph above this was 3, effectively reducing the impact of old results by a third each year (and by a ninth the following year).

The proposed system never quite matches the yellow line - though arguably nor should it - but it is consistently closer than the ICC and gradually gets closer the longer a team stays at the same level (in the third year of ratings at 120 it reaches 119).

More importantly, there are no jumps. As both points and weights are declined by the same amount, a team stays on the same rating until they play. Which is exactly how it should be.

Idle Summers 23rd October, 2018 23:33:04   [#] [0 comments] 

The fundamental sameness of ratings
Russell Degnan

Two things generally hold for any half-way reputable ratings system:

  1. Some observers will criticise a particular ranking because they've forgotten a set of results occurred.
  2. The basic result will look the same as everybody elses. The good teams will be good, the bad teams will be bad and there will be a big blob in the middle.

It is hard to fuck up a ratings system.

People do. But since in its most basic format, a league table is a ratings system, and a league table will usually give a decent approximation of the best and worst teams, it is quite hard to do worse.

It is worth considering its constituent elements though, for anyone planning to construct one, to outline the basic issues of design, and problems in scheduling structures that bring them undone (or need to be corrected for). To do this I'll look at five systems: a basic league table (which as noted, is a form of rating), the ICC system for cricket, the IRB system for rugby, Elo (which is widely used, but I'll focus on football) and my own cricket ratings.

Margin of victory

Your standard results-based backward-looking rating system (there may be others but they aren't widely used) bring together a set of results and project the quality (and possibly future performance) of teams based on games played. There are two basic two options for the result: the result; or the margin of victory.

A league table and the ICC use just the result. The IRB has a little each way by providing a bonus for a larger margin of victory, while the Elo soccer rating and my own cricket ratings use the margin of victory. The benefit of using the margin is that it provides more information, particularly in sports where the result is relatively predictable but it might reasonably be considered that there is a Pythagorean relationship between the points differential and eventual wins.

One thing I have considered but not implemented was to account for Test draws by examining "distance to victory". Hence a drawn match with one side 100 runs in arrears and the other needing 2 wickets would have a margin of 100 - 2 wickets (nominally 50 runs in my ratings). A non-linear relationship for wickets is another margin option to consider. Note that the added accuracy of these changes is marginal at best.

Strength of opposition

Here lies the first leap forward from a basic league table: the measuring of schedules. In many leagues this is not strictly necessary as the schedules are relatively even. In cricket it is nonsense to forgo it.

Accounting for opposition means having an implied probability of victory. That is: if an average team (50% probability of winning against an average team) is expected to win less than half of the matches against their schedule, then the points awarded are increased to make up the deficit. All ratings have an implied probability of victory, but the results vary.

My ratings are margin based - though there are adjustments for wins/draws - and use a "normal" implied probability around a standard deviation of 180 runs. Basic Elo ratings are based on three results (win/loss/draw) and therefore use a different function. Both of these are smooth curves, as per the image.

The ICC and IRB use linear models. The downside to linear models is that eventually teams hit a limit of 0% or 100% probability of victory, in which case their rating cannot increase (or decrease) in proportion to their ability versus their peers. For the IRB, this caps the maximum rating difference at 10 points (100% likely to win) - bonuses have got New Zealand up this limit, but the only way is down.

The ICC does something quite strange - on which more details can be found in this older post - which is to cut the implied probability at 90%. Teams above this threshold can theoretically keep increasing their rating to infinity, and vice versa (which is why a number of teams have been marooned around zero. There are ways this could be fixed, but will be the subject of a later post.

Strength at home

Adjustments to the implied probability are often made for home advantage. This is not strictly necessary as the difference is marginal, but if a schedule is sufficiently unbalanced it can be necessary.

A league table (needless to say) doesn't do so, and nor does the ICC. The other systems being examined do and the implied increase in probability given a ratings gap is shown above. Note that the IRB is linear as it merely shifts the system by 3 points. For Elo and my own ratings the greatest increase in probability is in the centre (around 50%) as that is where the variability is highest. A 1% probability of victory doesn't tend to shift regardless of home advantage.

Some systems extend this further by having a "home" and "away" rating that allows variation in the quality of home field advantage. The upside to this is that some teams are substantially better at home (teams at altitude for example) or poor away (isolated teams needing to travel) and this allows that to be accounted for. The downside is that it halves the amount of data - which for cricket is already sparse - unless some combination of recent home and away results is made. The standard method of adjusting for home advantage as if it is always the same isn't perfect but no rating system is. There are always trade-offs and the biggest is yet to come.

Recency of results

The fundamental difference between most systems is the speed and method by which they exclude results. Prediction models generally show that the more data put in the more closely the predictions run, which would imply that recent results should decay very slowly. However, no team stays the same, personnel changes, improves and declines, there is an element of form (perhaps indistinguishable with luck) and injuries will subtract and then add to quality just as the rating adjusts. There is no right answer.

Seasons offer a simple method, and a league table that resets to zero is as good a method as any if you don't want to predict results during the season. Conversely, it is a complex and unknown question how to adjust ratings from the previous season. FiveThirtyEight's Elo models converge to the mean, producing strange zig-zags for persistently strong sides. Leaving the rating as the previous season is not any better though. The most promising method if seasonal boundaries are fixed is to substantially lower the weight of old results, such that new results drive fast change, then get embedded in.

My ratings had a series of more complex issues to solve, and therefore decay in strange ways. Firstly, historically some teams played relatively infrequently - South Africa in the 1960s being the canonical example - which meant that when their ability moved, a differential system (as used by Elo or the IRB) would shift the teams with well-known ratings just as much as the team with few results. The first change was to add a weighted shift for number of results.

Secondly, the sparseness of matches and clumped schedule where teams would play the same side five times in a row means that there times when a rating needed to either shift rapidly or return to a spot after one bad (or good) tour. The solution was to keep a "form" variable that would add to the change if the direction of change was aligned. The "decay" of my ratings is therefore not a straight line, but an area: the top-line being relatively slow, and the bottom relatively quick, but converging after a couple of years.

As noted previously, the ICC ratings make strange and unnecessary choices regarding their result recency and every year the shift in ratings when no games have been played makes that clear. That aside though, the choice of decay is driven largely by the number of matches being played (and therefore the amount of data) and a personal preference for monitoring form. There is no correct answer, as even if the aim was to predict future results, there is unlikely to be a high level of consistency from one year to the next between models.

Baselining

The final element to ratings is actually the most complex element of all. For many leagues, where continuity is taken for granted, a baseline only matters as a point of historical interest. For others, such as the Elo chess ratings, where the volume of participants entering and leaving is high, there can be an impact on inflation, but not relative ability.

Cricket, and to an extent football, have their issues with baselining though, and it is worth considering them. Firstly, the introduction of new participants into a small closed system (like full member cricket) means adding a team at a level below the others that may (at some future point) be level with them. I rebaselined my ratings to 1000 for each of the first ten full members (using first-class ratings for Ireland and Afghanistan), and took their first rating as the lowest current member.

This is definitely wrong, as seen by the sharp drop in the rating of Bangladesh on entry to Test Cricket. The alternative is to run a new entry forward until they settle and then add them. But because of the accelerated decay detailed above, the wait for Bangladesh to find their level was short.

The second issue is more complex, and more likely to matter in other sports. In cricket, most teams don't play each other. To a degree even the full members don't play each other, but the standard rating system works by transivity: A > B and B > C means A > C. As long as a subset of teams play a random number of teams in the overall set of teams a rating system will work.

But this doesn't always happen. In cricket the full members play, and the associate members play (also split into tiers and regions), and on rare occasions there is a small set of matches between subsets of each of those sets.

Thus while the relative strength of each column of teams within a "league" is set, there are rarely enough games between teams in other leagues to be able to baseline the leagues against each other. In the diagram above each tier could be shifted up or down relative to the others, as only one or two teams is playing in each. Recently (and finally) the ICC extended their rating system to all Women's T20 International teams. it is a welcome change but some teams (such as Argentina) have not played outside their region for some years, while others (such as PNG) play a handful of matches outside their region every few years. If EAP was to improve, then PNG (as their only representative) will gain points internationally, then lose them to their regional rivals. The imbalance will change, but slowly, as PNG shuffle points back and forwards like fetching water from a well.

Football suffers a similar issue with relatively few inter-regional tournaments (like the World Cup) internationally or (the Champions League) domestically. Any ranking system that combines different subsets should be approached warily, though a correction is nearly impossible to provide.

The solution, probably, is to provide a regional subset adjustment system, whereby teams are weighted by their games within each subset and the whole subset shifted as the ranking of any member in the set changes. This would maintain the relative distance between all teams using the information we know - the relative rank of each team in their subset and the relative results of a representative team in the subset against a representative of another subset - and adjust the information we don't - the relative rank of each team in the whole set. Unfortunately, in Test cricket, the total number of matches between Test and Associate subsets is two and the Associate subset wasn't too well known anyway.

For the ICC, the benefits of a global ranking system will outweigh the complaints from the small handful of people following cricket in the Associates closely enough to notice oddities in the rankings. And if you do want to complain, remember, ratings systems make a lot of choices, and in most cases it isn't clear which are better - though I obviously have my opinion.

And anyway, they are all much the same. It is hard to fuck up a ratings system.

Idle Summers 14th October, 2018 22:16:14   [#] [0 comments] 

Americas WT20 Qualifier Review, Associate Cricket Podcast
Russell Degnan

As we switch from northern to southern summers there are tournaments across the world, and Andrew Nixon (@andrewnixon79) joins Russell Degnan (@idlesummers) to cover them all. Hong Kong gave India a scare in the Asia Cup (0:20), USA and Canada qualified for the next stage of the Americas WT20 qualifier (6:20), there were improved associate performances in the Africa T20 Cup (9:30) and we look at the East Asian Cup (13:00), and Stan Nagaiah Trophy (15:40). There is ICC news on the olympics and associate status (16:20) and a preview of the Asia (East) WT20 qualifiers and Hong Kong tour of PNG (23:00).

Direct Download Running Time 29min. Music from Martin Solveig, "Big in Japan"

The associate cricket podcast is an attempt to expand coverage of associate tournaments by obtaining local knowledge of the relevant nations. If you have or intend to go to a tournament at associate level - men`s women`s, ICC, unaffiliated - then please get in touch in the comments or by email.

Idle Summers 1st October, 2018 21:46:18   [#] [0 comments]