DEq - A Card Quality Metric

DEq - A Card Quality Metric#

Introduction#

DEq is a card quality metric for making decisions early in booster drafts. DEq directly models the average win equity added by selecting the given card, and the formula is derived using simplifying assumptions that allow us to calculate the metric from the daily data available on 17Lands. The primary metrics used for the metric are GP WR, ATA, and % GP, which all contribute positively. In this series of articles, I will first give a qualitative description of the method and the reasoning behind it. Next, I’ll go through my process for drafting guided by the metric and other data available on 17Lands. Finally, you can check out the mathematical derivation of the formulas, if you’re so inclined. I appreciate that some of you will find the mechanistic approach to card valuation distasteful, or will disagree with the approach. I hope and truly believe that you will get some value out of it regardless.

Rayblade Trooper earns a DEq value of +3.29%, good enough for a B+ based on strong stats across the board — it is drafted fairly early, has a high GP WR for top players, and has a high play rate.

Alpharael, Dreaming Acolyte has a slightly higher GIH WR than Rayblade Trooper, but it is picked later, played less often, and wins less when registered. It's DEq is +1.27%, or C+, which is solid but not close to the Trooper. Its relatively high GIH WR can be attributed partially to the metric's bias towards cards played in controlling decks.

Why DEq#

Since developing DEq I have a 65% win rate in Premier Draft over about 1200 games and a solid number of top 250ish season finishes. I’m not good enough at Magic to do that without a significant leg up, which DEq has provided. Additionally, I’ve developed a unique analysis to judge the effectiveness of card quality metrics at making first picks by analyzing the results of over 2.5 million drafts, and set in, set out, DEq outperforms the other available metrics. The final reason to use DEq to judge card quality is that it was built from the ground up for exactly the purpose of picking the best cards, unlike any other metric. Read on to find out how.

The Formula#

DEq can be decomposed into four main components, which are summed together then multiplied by % GP. For the officially posted rankings, I use the top player data set when possible, and filter for the latest dates possible to leave a satisfactory sample size.

Marginal Win Rate, or MWR: this is effectively the difference between GP WR of the card and the mean of the entire data set, with a small adjustment to smooth things out for small sample sizes.
Pick Equity, the heart of DEq: Pick equity is a decreasing quadratic function of ATA, such that a card with ATA 1.0 will have 3.0% pick equity, and a card with ATA 14.0 will have 0.0% pick equity.
Bias Adjustment: Scales between the raw MWR for early picks and an archetype-adjusted MWR for late picks using pick equity and an arbitrary coefficient as scale factors.
Metagame Adjustment: We expect the archetype win rate to regress to the mean over the observed period. This adjustment attributes some of that decay to the expected DEq of the individual card.

Quantum Riddler has a GP WR of 63.4% against a mean of 59.6%. Due to the small sample size, we attenuate the GP WR to 63.3%, giving a MWR of +3.7%

Thrumming Hivepool has an unremarkable GP WR, but due its ATA of 1.29, it earns +2.87% pick equity, enough for a high B

Early in OM1, Gallant Citizen is played primarily in GW, which has a ridiculous 4.2% archetype MWR. Averaging over its participation in the remaining archetypes, its archetype MWR is 2.1%. Gallant Citizen has a pick equity of 1.1%, so using a scale factor of 1 - 1.1%/3% ~ .63 and the fixed coefficient of 1.0 gives a bias adjustment of -2.1% * 1.0 * 0.63 ~ -1.3%.

Bayo, Irritable Instructor is a somewhat highly-picked rare in the worst color, red. It's achetype MWR is -3.4% and it gets 2.0% of that back from the bias adjustment, and an additional 0.1% from the expected one-day shift in the metagame in favor of red. Over time, that metagame adjustment will grow as the distance between observations and the draft date increase.

Why GP WR#

DEq was born out of a pair of observations. The first is that for pick one, pack one, the average outcome of the entire draft after that point, not just the games where the card in question was drawn, gives the proper estimation of the value of the pick. In statistical terms, the mean of the “as-picked” match win rate averaged over all P1P1 observations gives the maximum likelihood estimation of the expected win rate from that point. GP WR, along with % GP, are the fields available on 17Lands that represent that information with the least distortion. The second observation was that GIH WR, the most commonly used card quality metrics, has significant issues that create systemic bias in the evaluation of certain cards. Bluntly, it is our job to win all of the games we register our deck for, not just the ones where we draw the card in question. I believe that GP WR records (most of) the necessary win/loss information to judge picks, and that a good metric can be developed by directly addressing its biases rather than by “artificially” boosting the card strength signal, as GIH WR does.

Why Pick Equity#

Pick equity is intended to account for the cost, as measured in win rate, of the “unopened” pick exchanged for the card in question. A card with a lower ATA value was exchanged for more valuable picks, leaving less valuable picks available to accrue to the measured win rate with the rest of the pool. By restoring that value, we can more accurate compare cards against each other in the context of the same pack.

If that explanation isn’t intuitive to you, think of another way of obtaining the same result. A more natural way to combine the metrics might be to modify the pick order suggested by ATA by the win rate data to obtain a new pick order. Pick underrated cards, cards with a high win rate, earlier, and overrated cards later. DEq is doing the same thing, but presenting it in win rate space.

It turns out that this simple modification transforms GP WR from being a rather poor card quality metric to an extremely good one. While I believe the below adjustments are useful and make the metric more robust especially in evolving metagames, there is scant evidence they lead to higher win rates.

Briefly on the shape of the curve – the quality available in each pick is reduced by the quality of the card removed by the previous player. Since the quality of that card on average is decreasing, we should expect a convex shape to the pick equity curve, and data indeed supports that. The quadratic curve was chosen to reflect that convexity with simple assumptions while fitting the constraint that the quality removed from the second-to-last pack is nearly zero. The value of the first pick, 3.0%, is frankly not supported by evidence and was chosen because it leads to the best results. My analysis continually suggests a number closer to 2%, but 3% performs better, I think because it captures something else that is going on with early picks besides the opportunity cost itself. My best hypothesis is that it’s related to the idea that some important cards tend to fall to players in the open lane and need to be further discounted as first picks.

The Bias Adjustment#

The most commonly cited issue with GP WR is that it presents the win rate of the deck, or even the archetype at large, and not so much the particular card. While pick equity goes a long way to fixing this, as the early pick cards get the additional boost they deserve for carrying the load of getting you into a winning archetype, it is still the case that late-pick cards get an undeserved boost from the quality of deck likely to be correlated with the choice. To measure the impact of a card marginal to the decks it is likely to appear in, you can take it’s participation rate in each archetype, and take a weighted average of the mean win rates for those archetypes to use as the baseline win rate rather than the cohort’s overall mean win rate. Then the GP WR marginal to that number would be more particular to the card than to the archetype.

For a 14th pick, I think this calculation makes the most sense. Does playing the card improve the decks it is added to or not? However, for a first pick, as mentioned, the GP WR proper is the optimal estimator because you are taking responsibility for the entire draft as you make that first pick. The bias adjustment uses pick equity as a sliding scale between ATA 1.0 and 14.0 to balance these two calculations. The earlier the pick is observed, the more responsibility it has for the archetype the observed players ended up in.

Finally, this heuristic adjustment seems to be too strong in practice. It seems that independent of individual card quality, there is evidence for biasing towards successful archetypes and away from underperforming ones. Apply the adjustment strictly would tend to work against that bias. Therefore DEq has an additional free parameter that scales the bias adjustment, which has been calibrated to 0.6 using some backtesting and intuition.

The Metagame Adjustment#

The metagame changes over time. In the early format, certain decks have standout performance, and in response they are drafted more by the community over time. As a result, the cards required for the deck become spread over a larger number of decks, and the average quality of decks containing those cards decreases. On average, this is a predictable pattern. Everyone knows that the top deck on day one will not be as dominant in performance on day twelve. However, decks are not cards and it’s not obvious how to apply that decay to the rating on an individual card.

Inherent in DEq is interplay between win rate and pick order. As an individual card is picked higher, its win rate is expected to decrease, and DEq is hypothetically neutral to those adjustments. In fact, looking for that invariance in the data is another way to approach the pick equity measurement independent of the opportunity cost argument. However, that invariance assumes that the quality of the rest of the deck remains constant. In a changing metagame, the value of a card decreases as the expected quality of the deck one would play it in decreases. However, we shouldn’t expect the card quality to decrease by the full gap between the initial outperformance of the deck and average performance. While the win rate of the deck may decrease to average, the cause of that decay is the rise in pick orders for the cards, indicating that they are still premium picks, not average ones. If you are passed those same cards in the positions they were formerly available in, the deck will probably continue to outperform. Empirically, I’ve found that about 60% of that initial outperformance of the top decks can be allocated to modeling the decay in the quality of the individual cards, on average. What that means is that if, on day one, the top deck has a 2% outperformance compared to average, a characteristic card for the deck can be expected to have its DEq decline by about 1.2% as the format stabilizes.

The metagame adjustment uses the same archetype outperformance calculated in the bias adjustment step, and makes a calculation that estimates the shape of the decay of that value given the time period the data was collected over, and calculates an adjustment to DEq corresponding to it’s estimated value on the following day. The metagame decay is highly imprecise but I believe that on average it leads to more sensible picks.

Final Calculation#

We use the simplifying assumption that a card adds no value to your deck when it is not played. In reality, there can be selection bias (as with any of the other assumptions), but in general, we should expect a card with a lower % GP stat to contribute its value commensurately less often. As a result, we take the sum of the previously mentioned terms as the “In-Deck DEq”, and multiply by the % GP to obtain the final DEq value. This is the number that represents the estimate of the value, measured in win rate, added to the average deck when the card is added to your pool relative to the alternative of a basic land. Note that in order to aggregate multiple picks to estimate a deck win rate (which would be highly imprecise due to the lack of second-order effects, but still a worthwhile exercise), one should use a non-linear model like a logistic curve, and map the DEq values into logit space. This can be done and leads to interesting results which I hope to share with you down the road sometime.

The letter grade is then derived from the DEq value using a fixed scale that places the C/C- boundary near (actually just below) the 0.0% mark. I use a fixed scale rather than a normalized one because it gives a sense of the relative compression of quality that can occur over a format and the differences between formats. Still, you will find that there is remarkable consistency between the distributions of quality in different formats as they stabilize, typically with just a couple cards in the A+ range down to the largest group at C-. I have found it easy to communicate my letter grades with people used to the LR system without any kind of adjustment, but of course such things are always up for debate.

One word about negative DEq values. You will note that they are theoretically impossible. You can’t make a pool worse by adding a card with text to it. Nevertheless, a large group of cards will always have negative DEq according to the model, and some will even drop into what is assigned the “D+” range, which is as low as my scale goes. Because the worst possible card would probably end up with a DEq of 0 on account of not being played, some interpretation is required here. Rather than thinking of low-DEq cards as “bad”, it is better to think of them as “dangerous”. That is, these are the cards that have damaged people’s win rates, because they were picked and played and the results obtained were worse than simply skipping the pick, even accounting for the performance of the archetype somewhat. The way I approach drafting with DEq I effectively avoid any card below, say, -0.50% in DEq. I have almost never found a reason to play a card like that in any deck, despite tempting or popular cards frequently drifting well into that zone. (Looking at you, Breaching Dragonstorm). Above that, I want to make sure I have a clear story to tell of why my application is different from the typical one before I take a C- over a card with solid numbers. To complete the thought, at C and above, I want to have a positive story to tell about the role the card plays in the format for me and the decks I intend to build around it. I find that if I have strategies in mind for all cards at C and above, I barely have to think about the cards below that, to my benefit.

Conclusion#

Keep in mind that DEq is not infallible or complete. It is an estimate based on my best analysis so far. In the future I may update the formula, although the combination of ATA and GP WR will always be the heart of it. A metric based on broad averages will never completely capture either optimal play or your particular play style, so you should develop adjustments that go beyond what you find in the metric as you understand the set and the metagame better. In my TDM post-mortem I’ve already looked at one example where the assumptions DEq uses did not seem to hold for a subset of cards. If it’s possible to predict that phenomenon using the daily data, then that’s the kind of thing I would hope to improve down the road. As you think about both DEq and other metrics, I encourage you to reflect on what exactly the metric is measuring and how that may differ from your goals in using it.

Hopefully now you have an idea of why DEq is the way it is and why you might want it to be that way. Before we look into some examples of using the metric in practice, let’s think about what it is supposed to mean. DEq represents the average amount of win rate added for the average observation when the card was chosen. Applying it your own situation immediately asks that you apply some estimation of the difference between those two distributions. Once you’ve got your starting number, it’s worth holding that number in your head as you make your future picks, and to remember that an expectation is not set in stone. As you make your next pick, that will affect the value represented by your earlier picks. Pick a card that is more synergistic than what was observed in the average data point, and the value of your earlier picks increases. Pick a card in a completely different archetype, and in some sense the value represented by the two cards might be mutually exlusive – you will either get one or the other, and the expected value of each decreases. While I don’t advocating doing any math while drafting, it’s worth holding a sense of those magnitudes in your head. In the next article I’ll walk through some picks and show exactly how I use the metric and some of the other information on 17Lands.com to learn the format and guide my drafts.