Many statistical "gold standards" aren't perfect, but that's why they're perfectly named

Statisticians and computer scientists often use the term "gold standard" for the best possible benchmark you could have when trying to estimate something. For example, Elizabeth Sweeney has worked on algorithms for tagging brain lesions in MRIs, and compares the results against a gold standard of human neurologists attempting the same task. In clinical trials, randomized clinical trials (RCTs) are sometimes referred to a gold standard approach for estimating a treatment effect because they are more likely to account for unmeasured confounders.

But in these examples and in others, statisticians often question the benchmark's "gold standard" status, saying something the lines of "it's not really a gold standard" because it isn't perfect. Imperfect neurologists do not always agree with each other, are not always even internally consistent.1 Randomized trials help to control for unmeasured confounders over repeated trials, but are not guaranteed to do so in any one, specific trial.

My take is that the term "gold standard" should still apply though. More-so even, because the actual gold standard of matching currency to gold reserves isn't perfect either.2 It's a minor point, but a fun one that's sometimes3 forgotten.

This post is not meant to be political, but it's fairly well established that the gold standard has both advantages and disadvantages. Tying your currency to gold reserves helps to control inflation, but it also highly restricts your monetary policy. Prices tied to gold tend to be stable in the long run, but not always stable in the short run. For most mainstream economists, the connotations of the gold standard are pretty bad.

In some ways, gold is just another attempt at a best practice for objective value. Like paper money, it's not inherently valuable, it's only valuable by convention. In much the same way, when I hear the someone talk about a statistical "gold standard," I like to think of it as a subtle reminder that, for many estimation problems, even the best benchmarks aren't perfect.

Special thanks to economists Daniel Garcia Molina and Sohini Mahapatra for their help with this post!

  1. Within the OASIS paper, section 3.3 discusses within-rater variability, and section 3.4 discusses between-rater variability. 

  2. Of course, one counter argument is that language is defined by modern usage rather than historical meaning, the same reason that we acknowledge how "literally" can informally be used to mean "figuratively". 

  3. To some extent, this point is raised on the wikipedia page for a gold standard test