This article is part 2 of 2.
In part 1, I noted that using shot metrics for evaluating individual players is heavily influenced by teammates, coaches usage (zone starts), and competition*.
I believe we have decent tools for understanding the effect of teammates and zone starts – but I believe this is not at all true for competition metrics (dubbed QoC, or Quality of Competition).
And the reality is that understanding competition is critical to using shot metrics for player evaluation. If current QoC measures are not good, this means QoC is a huge weakness in the use of shot metrics for player evaluation.
I believe this is the case.
Let’s see if I can make a convincing case for you!
*Truthfully, there are quite a few other contextual factors, like team, and score state. These shot metrics have been around for a decade plus, and they’ve been studied (and are now often adjusted) heavily. Some of the effects that have been identified can be quite subtle and counterintuitive. From the point of view of assessing *a* player on *a* team, it doesn’t hurt us to focus on these three factors.
It Just Doesn’t Matter – You’re Kidding, Right?
If you bring up Quality of Competition with many fancystats people, they’ll often look at you and flat out tell you that “quality of competition doesn’t matter.”
This response will surprise many – and frankly, it should.
We know competition matters.
We know that a player is going to have a way harder time facing Sidney Crosby than facing Tanner Glass.
We know that coaches gameplan to face Taylor Hall, not his roommate Luke Gazdic (so long, lads). And they gameplan primarily with player matchups.
Are our eyes and the coaches that far out to lunch?
Yes, say the fancystats. Because, they say, when you calculate quality of competition, you just don’t see that much difference in the level of competition faced by different players. Therefore, so conventional wisdom dictates, it doesn’t matter.
The Numbers Suggest Matchups Matter
I don’t have to rely on just the eye test to contradict this line of thought – the numbers do the work too. For example, here are the head to head matchup numbers (I trot these out as a textbook example of coaching matchups) for the three Montreal defense pairs against Edmonton from the game on February 7th, 2016:
vs |
Hall |
McDavid |
Subban-Markov |
~ 3 mins |
~ 10 mins |
Petry-Emelin |
~ 8 mins |
~ 5 mins |
Gilbert-Barberio |
~ 40 seconds |
~ 14 seconds |
Does that look like “Quality of Competition” doesn’t matter? It sure mattered for both Hall and McDavid, not to mention all three Montreal defense pairs. Fifteen minutes vs 14 seconds is not a coincidence. That was gameplanned.
So how do we reconcile this?
Let’s dig in and see why maybe conventional wisdom is just plain wrong – maybe the problem is not with the quality of competition but the way in which we measure it.
It Would Hit You Like Peter Gabriel’s Sledgehammer
I’ll start by showing you an extremely valuable tool for assessing players in the context of zone starts and QoC, which is Rob Vollman’s Player Usage Charts, often called sledgehammer charts.
This chart is for Oiler defensemen in 2015-2016:
This shows three of the four things we’ve talked about previously:
- The bubble colour (blue good) shows the shot metrics balance of good/bad for that individual
- The farther to the right the bubble, the more faceoffs a player was on the ice for in the offensive zone – favourable zone starts or coaches usage in other words
- The higher the bubble, the tougher the Quality of Competition
Notice something about the QoC though. See how it has such a narrow range? The weakest guy on there is Clendening at -0.6. The toughest is Klefbom at a shade over 1.0.
If you’re not familiar with “CorsiRel” (I’ll explain later), take my word for it: that’s not a very meaningful range. If you told me Player A has a CorsiRel of 1.0, and another has a CorsiRel of 0.0, I wouldn’t ascribe a lot of value to that difference. Yet that range easily encompasses 8 of the 11 defenders on the chart.
So no wonder the fancystatters say QoC doesn’t matter. The entire range we see, for a full season for an entire defensive corps worst to last, is a very small difference. Clendening basically faced barely weaker competition than did Klefbom.
Or did he? That doesn’t sound right, does it? Yeah, the Oiler D was a tire fire and injuries played havoc – but Todd McLellan wasn’t sending Clendening out to face Joe Thornton if he could help it.
To figure out what might be wrong, let’s dig in to see how we come up with these numbers that show such a thin margin of difference.
Time Weighs On Me
The process for calculating a QoC metric starts by assigning every player in the league a value that reflects how tough they are as competition.
Then when we need the QoC level faced by a particular player:
- we look at all the players he faced, multiply (weight) the amount of time spent against that player with the competition value of that player
- we add it all up, and presto, you have a QoC measure for the given player
Assuming that the time on ice calculations are reasonably fixed by, you know, time on ice, it should be clear that the validity of this QoC metric is almost entirely dependent on the validity of the ‘competition value’ assigned to each player.
If that competition value isn’t good, then you have a GIGO (garbage in garbage out) situation, and your QoC metric isn’t going to work either.
There are three different data values that are commonly used for calculating a QoC metric, so let’s take a look at each one and see if it meets the test of validity.
Using Corsi for Qoc
Many fancystats people who feel that QoC doesn’t matter will point to this post by Eric Tulsky to justify their reasoning.
Tulsky (now employed by the Hurricanes) is very, very smart, and one of the pillars of the hockey fancystats movement. He’s as important and influential as Vic Ferarri (Tim Barnes), JLikens (Tore Purdy), Gabe Desjardins, and mc79hockey (Tyler Dellow). So when he speaks – we listen.
The money quote in his piece is this:
Everyone faces opponents with both good and bad shot differential, and the differences in time spent against various strength opponents by these metrics are minimal.
Yet all that said – I think Tulsky’s conclusions in that post on QoC are wrong. I would assert that the problem he encounters, and the reason he gets the poor results that he does, is that he uses a player’s raw Corsi (shot differential) as the sole ‘competition value’ measure.
All his metric does is tell you is how a player did against other players of varying good and bad shot differential. It actually does a poor job of telling you the quality of the players faced, which is the leap of faith being made. Yet the leap is unjustified, because players of much, much different ability can have the same raw Corsi score.
To test that, we can rank all the players last season by raw Corsi, and here’s a few of the problems we immediately see:
- Patrice Cormier (played two games for WPG) is the toughest competition in the league
- He’s joined in the Top 10 by E Rodrigues, Sgarbossa, J Welsh, Dowd, Poirier, Brown, Tangradi, Witkowski, and Forbort.
- Mark Arcobello is in the top 20, approximately 25 spots ahead of Joe Thornton
- Anze Kopitar just signed for $10MM/yr while everyone nodded their head in agreement – while Cody Hodgson might have to look for work in Europe, and this will garner the same reaction. Yet using raw Corsi as the measure, they are the same level of competition (57.5%)
- Chris Kunitz is about 55th on the list – approximately 40 spots ahead of Sidney Crosby
- Don’t feel bad, Sid – at least you’re miles ahead of Kessel, Jamie Benn, and Nikita Nikitin – who is himself several spots above Brent Burns and Alex Ovechkin.
*Note: all data sourced from the outstanding site corsica.hockey. Pull up the league’s players, sort them using the factors above for the 2015-2016 season, and you should be able to recreate everything I’m describing above.
I could go on, but you get the picture, right? The busts I’ve listed are not rare. They’re all over the place.
Now, why might we be seeing these really strange results?
- Sample size! Poor players play little, and that means their shot metrics can jump all over the place. Play two minutes, have your line get two shots and give up one shot, and raw Corsi will anoint you one of the toughest players in the league. We can account for this when looking at the data, but computationally it can wreak havoc if unaccounted for.
- Even with large sample sizes, you can get very minimal difference in shot differential between very different players because of coaches matching lines and playing “like vs like”. The best players tend to play against the best players and their Corsi is limited due to playing against the best. Similarly, mediocre players tend to play against mediocre players and their Corsi is inflated accordingly. It’s part of the problem we’re trying to solve!
- For that same reason, raw Corsi tends to overinflate the value of 3rd pairing Dmen, because they so often are playing against stick-optional players who are Corsi black holes.
- The raw Corsi number is heavily influenced by the quality of the team around a player.
Corsi is a highly valuable statistic, particularly as a counterpoint to more traditional measures like boxcars. But as a standalone measure for gauging the value of a player, it is deeply flawed. Any statistic that uses raw Corsi as its only measure of quality is going to fail. GIGO, remember?
Knowing what we know – is it a surprise that Tulsky got the results he got?
So we should go ahead and rule out using raw Corsi as a useful basis for QoC.
Using Relative Corsi for QoC
If you aren’t familiar with RelCorsi, it’s pretty simple: instead of using a raw number, for each player we just take the number ‘relative’ to the teams numbers.
For example, a player with a raw Corsi of 52 but on a team that is at 54 will get a -2, while a player with a raw Corsi of 48 will get a +2 if his team is at 46.
The idea here is good players on bad teams tend to get hammered on Corsi, while bad players on good teams tend to get a boost. So we cover that off by looking at how good a player is relative to their team.
Using RelCor as the basis for a QoC metric does in general appear to produce better results. When you look at a list of players using RelCor to sort them, the cream seems to be more likely to rise to the top.
Still, if you pull up a table of players sorted by RelCor (the Vollman sledgehammer I posted earlier uses this metric as its base for QoC), again you very quickly start to see the issues:
- Our top 10 is once again a murderers row of Vitale, Sgarbossa, Corey Power Potter Play, Rodrigues, Brown, Tangradi, Poirier, Cormier, Welsh, and Strachan.
- Of all the players with regular ice time, officially your toughest competition is Nino Niederreiter. Nino? No no!
- Top defenders Karlsson and Hedman are right up there, but they are followed closely by R Pulock and D Pouliot, well ahead of say OEL and Doughty.
- Poor Sid, he can’t even crack the Top 100 this time.
Again, if we try and deconstruct why we get these wonky results, it suggests two significant flaws:
- Coach’s deployment. Who a player plays and when they play is a major driver of RelCor. You can see this once again with 3rd pairing D men, whose RelCor, like their raw Corsi, is often inflated.
- The depth of the team. Good players on deep teams tend to have weaker RelCors than those on bad teams (the opposite of the raw Corsi effect). This is why Nicklas Backstrom (+1.97) and Sam Gagner (+1.95) can have very similar RelCor numbers while being vastly different to play against.
RelCor is a very valuable metric in the right context, but suffers terribly as a standalone metric for gauging the value of a player.
Like raw Corsi, despite its widespread use we should rule out relative Corsi as a useful standalone basis for QoC.
Using 5v5 TOI for QoC
This is probably the most widely used (and arguably best) tool for delineating QoC. This was also pioneered by the venerable Eric Tulsky.
When we sort a list of players using the aggregated TOI per game of their “average” opponent, we see the cream tend to rise to the top even moreso than with RelCor.
And analyzing the data under the hood used to generate this QoC, our top three “toughest competition” players are now Ryan Suter, Erik Karlsson, and Drew Doughty. Sounding good, right?
But like with the two Corsi measures, if you look at the ratings using this measure, you can still see problematic results all over, with clearly poor players ranked ahead of good players quite often. For example:
- The top of the list is all defensemen.
- Our best forward is Evander Kane, at #105. Next up are Patrick Kane (123rd), John Tavares (134th), and Taylor Hall (144th). All top notch players, but the ranking is problematic to say the least. Especially when you see Roman Polak at 124th.
- Even among defensemen, is Subban really on par with Michael del Zotto? Is Jordan Oesterle the same as OEL? Is Kris Russel so much better than Giordano, Vlasic, and Muzzin?
- Poor old Crosby is still not in the Top 100, although he finally is when you look at just forwards.
- Nuge is finally living up to his potential, though, ahead of Duchene and Stamkos!
OK, I’ll stop there. You get my point. This isn’t the occasional cherry picked bust, you can see odd results like this all over.
Looking at the reasons for these busts, you see at least two clear reasons:
- Poor defensemen generally get as much or more time on ice than do very good forwards. Putting all players regardless of position on the same TOI scale simply doesn’t work. (Just imagine if we included goaltenders in this list – even the worst goalies would of course skyrocket to the top of the list).
- Depth of roster has a significant effect as well. Poor players on bad teams get lots of ice time – it’s a big part of what makes them bad teams after all. Coaches also have favourites or assign sideburns to players for reasons other than hockeying (e.g. Justin Schultz and the Oilers is arguably a good example of both weak depth of roster and coach’s favoritism).
So once again, we find ourselves concluding that the underlying measure to this QoC, TOI, tells you a lot about a player, but there are very real concerns in using it as a standalone measure.
Another problem shows up when we actually try to use this measure in the context of QoC: competition blending.
As a player moves up and down the roster (due to injuries or coaches preference) their QoC changes. At the end of the year we are left with one number to evaluate their QoC but if this roster shuttling has happened, that one number doesn’t represent who they actually played very well.
A good example of the blending problem is Mark Fayne during this past year. When you look at his overall TOIQoC, he is either 1 or 2 on the Oilers, denoting that he had the toughest matchups.
His overall CF% was also 49.4%, so a reasonable conclusion was that “he held his own against the best”. Turns out – it wasn’t really true. He got shredded like coleslaw against the tough matchups.
Down the road, Woodguy (@Woodguy55) and I will show you why this is not really true, and that it is a failing of TOIC as a metric. It tells us how much TOI a player’s average opponent had, but it doesn’t tell us anything more. We’re left to guess, with the information often pointing us in the wrong direction.
A Malfunction in the Metric
Let’s review what we’ve discussed and found so far:
- QoC measures as currently used do not show a large differentiation in the competition faced by NHL players. This is often at odds with observed head to head matchups.
- Even when they do show a difference, they give us no context on how to use that to adjust the varying shot metrics results that we see. Does an increase of 0.5 QoC make up for a 3% Corsi differential between players? Remember from Part 1 that understanding the context of competition is critical to assessing the performance of the player. Now we have a number – but it doesn’t really help.
- The three metrics most commonly used as the basis for QoC are demonstrably poor when used as a standalone measure of ‘quality’ of player.
- So it should be no surprise that assessments using these QoC measures produce results at odds with observation.
- Do those odd results reflect reality on the ice, or a malfunction in the metric? Looking in depth at the underlying measures, the principle of GIGO suggests it may very well be the metric that is at fault.
Which leaves us … where?
We know competition is a critical contextual aspect of using shot metrics to evaluate players.
But our current QoC metrics appear to be built on a foundation of sand.
Hockey desperately needs a better competition metric.
Now lest this article seem like one long shrill complaint, or cry for help … it’s not. It’s setting the background for a QoC project that Woodguy and I have been working on for quite some time.
Hopefully we’ll convince you there is an answer to this problem, but it requires approaching QoC in an entirely different way.
Stay tuned!
P.S.
And the next time someone tells you “quality of competition doesn’t matter”, you tell them that “common QoC metrics are built on poor foundational metrics that cannot be used in isolation for measuring the quality of players. Ever hear of GIGO?”
Then drop the mic and walk.
Can’t wait to see what your answer is. There are a few flaws in the different shot metrics, especially when not fully understood. Many mainstream media personalities do not fully understand how the metrics are intended to be used. This results in poor evaluation and a vicious circle that captures others who do not use the metrics properly. The tutorial of part one was great for educational purposes and reinforcing what few already understand. Thanks!
You guys probably aren’t doing yourselves any favors by using “Muddle” and “dregs” to name tier two and three. If you really believe in the metric, the long term goal should be to have it accepted in the mainstream, and the medium to get it there includes the mainstream media. It is highly unlikely that the mainstream media is going to accept and promote a metric that uses derogatory terms to classify players, which dregs certainly is and muddle likely is going to perceived as.
Players are more asked about their corsi or their views on corsi. Can you imagine someone asking them, “So, how do you feel about [teammate x] being classified as “dregs” on the WoodMoney” metric?” Or, “Do you think you can be better than a muddle?”
I get the attempt at humour, but it’s probably counter productive.