The fundamental sameness of ratings
Russell Degnan

Two things generally hold for any half-way reputable ratings system:

  1. Some observers will criticise a particular ranking because they've forgotten a set of results occurred.
  2. The basic result will look the same as everybody elses. The good teams will be good, the bad teams will be bad and there will be a big blob in the middle.

It is hard to fuck up a ratings system.

People do. But since in its most basic format, a league table is a ratings system, and a league table will usually give a decent approximation of the best and worst teams, it is quite hard to do worse.

It is worth considering its constituent elements though, for anyone planning to construct one, to outline the basic issues of design, and problems in scheduling structures that bring them undone (or need to be corrected for). To do this I'll look at five systems: a basic league table (which as noted, is a form of rating), the ICC system for cricket, the IRB system for rugby, Elo (which is widely used, but I'll focus on football) and my own cricket ratings.

Margin of victory

Your standard results-based backward-looking rating system (there may be others but they aren't widely used) bring together a set of results and project the quality (and possibly future performance) of teams based on games played. There are two basic two options for the result: the result; or the margin of victory.

A league table and the ICC use just the result. The IRB has a little each way by providing a bonus for a larger margin of victory, while the Elo soccer rating and my own cricket ratings use the margin of victory. The benefit of using the margin is that it provides more information, particularly in sports where the result is relatively predictable but it might reasonably be considered that there is a Pythagorean relationship between the points differential and eventual wins.

One thing I have considered but not implemented was to account for Test draws by examining "distance to victory". Hence a drawn match with one side 100 runs in arrears and the other needing 2 wickets would have a margin of 100 - 2 wickets (nominally 50 runs in my ratings). A non-linear relationship for wickets is another margin option to consider. Note that the added accuracy of these changes is marginal at best.

Strength of opposition

Here lies the first leap forward from a basic league table: the measuring of schedules. In many leagues this is not strictly necessary as the schedules are relatively even. In cricket it is nonsense to forgo it.

Accounting for opposition means having an implied probability of victory. That is: if an average team (50% probability of winning against an average team) is expected to win less than half of the matches against their schedule, then the points awarded are increased to make up the deficit. All ratings have an implied probability of victory, but the results vary.

My ratings are margin based - though there are adjustments for wins/draws - and use a "normal" implied probability around a standard deviation of 180 runs. Basic Elo ratings are based on three results (win/loss/draw) and therefore use a different function. Both of these are smooth curves, as per the image.

The ICC and IRB use linear models. The downside to linear models is that eventually teams hit a limit of 0% or 100% probability of victory, in which case their rating cannot increase (or decrease) in proportion to their ability versus their peers. For the IRB, this caps the maximum rating difference at 10 points (100% likely to win) - bonuses have got New Zealand up this limit, but the only way is down.

The ICC does something quite strange - on which more details can be found in this older post - which is to cut the implied probability at 90%. Teams above this threshold can theoretically keep increasing their rating to infinity, and vice versa (which is why a number of teams have been marooned around zero. There are ways this could be fixed, but will be the subject of a later post.

Strength at home

Adjustments to the implied probability are often made for home advantage. This is not strictly necessary as the difference is marginal, but if a schedule is sufficiently unbalanced it can be necessary.

A league table (needless to say) doesn't do so, and nor does the ICC. The other systems being examined do and the implied increase in probability given a ratings gap is shown above. Note that the IRB is linear as it merely shifts the system by 3 points. For Elo and my own ratings the greatest increase in probability is in the centre (around 50%) as that is where the variability is highest. A 1% probability of victory doesn't tend to shift regardless of home advantage.

Some systems extend this further by having a "home" and "away" rating that allows variation in the quality of home field advantage. The upside to this is that some teams are substantially better at home (teams at altitude for example) or poor away (isolated teams needing to travel) and this allows that to be accounted for. The downside is that it halves the amount of data - which for cricket is already sparse - unless some combination of recent home and away results is made. The standard method of adjusting for home advantage as if it is always the same isn't perfect but no rating system is. There are always trade-offs and the biggest is yet to come.

Recency of results

The fundamental difference between most systems is the speed and method by which they exclude results. Prediction models generally show that the more data put in the more closely the predictions run, which would imply that recent results should decay very slowly. However, no team stays the same, personnel changes, improves and declines, there is an element of form (perhaps indistinguishable with luck) and injuries will subtract and then add to quality just as the rating adjusts. There is no right answer.

Seasons offer a simple method, and a league table that resets to zero is as good a method as any if you don't want to predict results during the season. Conversely, it is a complex and unknown question how to adjust ratings from the previous season. FiveThirtyEight's Elo models converge to the mean, producing strange zig-zags for persistently strong sides. Leaving the rating as the previous season is not any better though. The most promising method if seasonal boundaries are fixed is to substantially lower the weight of old results, such that new results drive fast change, then get embedded in.

My ratings had a series of more complex issues to solve, and therefore decay in strange ways. Firstly, historically some teams played relatively infrequently - South Africa in the 1960s being the canonical example - which meant that when their ability moved, a differential system (as used by Elo or the IRB) would shift the teams with well-known ratings just as much as the team with few results. The first change was to add a weighted shift for number of results.

Secondly, the sparseness of matches and clumped schedule where teams would play the same side five times in a row means that there times when a rating needed to either shift rapidly or return to a spot after one bad (or good) tour. The solution was to keep a "form" variable that would add to the change if the direction of change was aligned. The "decay" of my ratings is therefore not a straight line, but an area: the top-line being relatively slow, and the bottom relatively quick, but converging after a couple of years.

As noted previously, the ICC ratings make strange and unnecessary choices regarding their result recency and every year the shift in ratings when no games have been played makes that clear. That aside though, the choice of decay is driven largely by the number of matches being played (and therefore the amount of data) and a personal preference for monitoring form. There is no correct answer, as even if the aim was to predict future results, there is unlikely to be a high level of consistency from one year to the next between models.

Baselining

The final element to ratings is actually the most complex element of all. For many leagues, where continuity is taken for granted, a baseline only matters as a point of historical interest. For others, such as the Elo chess ratings, where the volume of participants entering and leaving is high, there can be an impact on inflation, but not relative ability.

Cricket, and to an extent football, have their issues with baselining though, and it is worth considering them. Firstly, the introduction of new participants into a small closed system (like full member cricket) means adding a team at a level below the others that may (at some future point) be level with them. I rebaselined my ratings to 1000 for each of the first ten full members (using first-class ratings for Ireland and Afghanistan), and took their first rating as the lowest current member.

This is definitely wrong, as seen by the sharp drop in the rating of Bangladesh on entry to Test Cricket. The alternative is to run a new entry forward until they settle and then add them. But because of the accelerated decay detailed above, the wait for Bangladesh to find their level was short.

The second issue is more complex, and more likely to matter in other sports. In cricket, most teams don't play each other. To a degree even the full members don't play each other, but the standard rating system works by transivity: A > B and B > C means A > C. As long as a subset of teams play a random number of teams in the overall set of teams a rating system will work.

But this doesn't always happen. In cricket the full members play, and the associate members play (also split into tiers and regions), and on rare occasions there is a small set of matches between subsets of each of those sets.

Thus while the relative strength of each column of teams within a "league" is set, there are rarely enough games between teams in other leagues to be able to baseline the leagues against each other. In the diagram above each tier could be shifted up or down relative to the others, as only one or two teams is playing in each. Recently (and finally) the ICC extended their rating system to all Women's T20 International teams. it is a welcome change but some teams (such as Argentina) have not played outside their region for some years, while others (such as PNG) play a handful of matches outside their region every few years. If EAP was to improve, then PNG (as their only representative) will gain points internationally, then lose them to their regional rivals. The imbalance will change, but slowly, as PNG shuffle points back and forwards like fetching water from a well.

Football suffers a similar issue with relatively few inter-regional tournaments (like the World Cup) internationally or (the Champions League) domestically. Any ranking system that combines different subsets should be approached warily, though a correction is nearly impossible to provide.

The solution, probably, is to provide a regional subset adjustment system, whereby teams are weighted by their games within each subset and the whole subset shifted as the ranking of any member in the set changes. This would maintain the relative distance between all teams using the information we know - the relative rank of each team in their subset and the relative results of a representative team in the subset against a representative of another subset - and adjust the information we don't - the relative rank of each team in the whole set. Unfortunately, in Test cricket, the total number of matches between Test and Associate subsets is two and the Associate subset wasn't too well known anyway.

For the ICC, the benefits of a global ranking system will outweigh the complaints from the small handful of people following cricket in the Associates closely enough to notice oddities in the rankings. And if you do want to complain, remember, ratings systems make a lot of choices, and in most cases it isn't clear which are better - though I obviously have my opinion.

And anyway, they are all much the same. It is hard to fuck up a ratings system.

Idle Summers 14th October, 2018 22:16:14   [#] 

Comments