Humans Aren’t Always the Gold Standard We Think They Are 

In machine learning, there’s a persistent belief that if you’re training a model to understand the world, your best bet is to start with data labeled by humans. After all, humans have eyes, context, and judgment and we’ve been interpreting the world around us our entire lives. So, when we label data, especially for tasks like object detection or classification, it’s easy to assume that human annotations represent “ground truth.” 

But what if the “truth” is a little messier than that? 

Back in 2009, I co-authored a paper introducing the Overhead Imagery Research Data Set (OIRDS) – a relatively large, open-access library of aerial imagery curated specifically for vehicle detection and classification research*. At the time, deep learning was taking off and the availability of labeled datasets was a key driver to research.  However, there was a gap in the availability of annotated overhead imagery, so we built a dataset from scratch and enlisted a team of human annotators to label targets (in our case ground vehicles) in each image. 

The annotators weren’t domain experts. They were untrained volunteers given some basic instructions and a set of reference images. And that was by design – we wanted to simulate the kind of annotation process that’s common in industry where labeling is often crowdsourced or performed by undergraduate or graduate students in an academic lab. 

What we learned was both humbling and instructive. 

Humans Don’t Always “See” the Same Thing 

Across the dataset, every image was reviewed by 3 to 5 independent annotators. In theory, this redundancy should have helped us establish a strong consensus about what was in each image. But in practice, the results were inconsistent. 

When it came to detecting vehicles, human annotators only agreed on about 95% of the targets. In other words, 1 in every 20 vehicles identified by one rater was missed by another. These weren’t obscure edge cases—they were everyday omissions (‘Oh, I didn’t see that’) or straightforward disagreements about what qualified as a vehicle in cluttered, occluded, or shadowed scenes.  The disagreement widened even further when we looked at classification. Annotators were asked to label vehicle type (e.g., car, truck, pickup, unknown), and only agreed about 85% of the time. That’s a significant gap for something as seemingly basic as distinguishing a sedan from a pickup truck.  Worse still, several studies have shown that assigning incorrect class labels degrades model performance more severely than simply “missing” an object**. 

FIgure 1

Three images from OIRDSThe vehicles denoted by the Green polygons had 100% annotator agreementThe vehicles annotated in Red had <50% agreement. 

And the disagreements weren’t random. Some annotators consistently labeled more vehicles than others. Some were stricter about what counted as “occluded.” One rater, for instance, was red-green colorblind—something that subtly influenced their interpretation of vehicle color and occasionally led to “misclassifications”. 

What Causes the Inconsistency? 

Several factors contributed to the inconsistency we observed: 

  • Subjectivity in visual perception: What one person sees as a truck, another might interpret as a large SUV—especially when resolution is low or shadows obscure important features. 
  • Ambiguity in instructions: Even with drop-down menus and reference guides, annotators interpreted guidance differently. For example, the term “camouflaged” caused confusion until we clarified that it referred to intentional concealment, not just color blending. 
  • Annotation environment: Differences in monitors, lighting, and calibration introduced subtle variation in how features like color or shadow were perceived. 
  • Task fatigue: Labeling hundreds of images in sequence can lead to attention drift, especially in cluttered scenes. Missed detections and inconsistent classifications often happened late in sessions. 
  • Ontology limitations: Boolean fields like “casts shadow?” proved too coarse. Some targets clearly cast shadows, but annotators couldn’t agree whether the shadow was “large enough” to count. Without gradations or confidence levels, binary responses failed to capture meaningful nuance. 

Why This Still Matters 

You might think this is just an artifact of an older study – after all, this research is now 15 years old. But the core lesson is more relevant than ever. 

Today’s foundation models are bigger, faster, and often pre-trained on millions of weakly labeled examples. But many production AI systems – especially in defense, geospatial, and medical domains—still depend on human annotations for training, tuning, or evaluation. If those annotations are inconsistent or ambiguous, we risk drawing flawed conclusions about model performance. 

In fact, annotation noise isn’t just a challenge for training – it affects evaluation too. If annotators disagree about what’s in the ground truth, how can we confidently measure recall, precision, or F1 score? It’s entirely possible that a model could detect a vehicle correctly and still be penalized because the annotator missed it. Any model developer can tell you how frustrating it is to get dinged on F1 only to discover that it was a result of errors in the validation set.     

So What Can We Do? 

This isn’t a call to get rid of human annotators. Far from it. But it is a call to rethink how we use them and more importantly, the process for collecting annotations (a subject for a future blog post). 

Here are a few ideas drawn from the OIRDS experience and modern best practices: 

  • Use multiple judgements per image and treat agreement as a confidence metric – not an assumption. Aznd yes, I say ‘judgements’, not just ‘annotators’, because pre-labeling and machine-assisted workflows are now a critical part of the process. 
  • Design ontologies carefully to capture meaningful differences without oversimplifying. Use sliders, probabilistic fields, or optional comments to capture ambiguity when it matters. 
  • Invest in rater training and calibration. This is especially critical for non-literal image data like Synthetic Aperture Radar (SAR). 
  • Recognize that “ground truth” is sometimes just consensus – and sometimes, that consensus still has a margin of error. 
  • Have a well calibrated quality control process built into the pipeline.  Again, “consensus” still has a margin of error.   

As an aside: those last two bullets will be topics for future blog posts.  Specifically, Figure-Eight has a proprietary process for quantifying rater consensus and uncertainty that model developers can use in weighting training data.  Further, our quality control process helps ensure that “incorrect” labels are caught before being incorporated into a data set.  Stay tuned for that post.   

Final Thoughts 

The OIRDS project was an early attempt to create a richly annotated overhead imagery dataset—and it taught us a valuable lesson: even with the best intentions, untrained humans are not always a reliable benchmark for truth. As we build ever more sophisticated AI systems, we need to be equally thoughtful about the data we feed them – and the humans behind the labels. 

In the end, the quality of your model is only as good as the data and the people it’s built on. 

* Tanner, F., Colder, B., Pullen, C., Heagy, D., Eppolito, M., Carlan, V., Oertel, C., & Sallee, P. (2009). Overhead Imagery Research Data Set – An annotated data library & tools to aid in the development of computer vision algorithms. Proceedings of the IEEE Applied Imagery Pattern Recognition Workshop (AIPR), 2009

** Agnew, C., Scanlan, A., Denny, P., Grua, E. M., Van De Ven, P., & Eising, C. (2024). Annotation Quality Versus Quantity for Object Detection and Instance Segmentation. IEEE Access, 12, 140958-140977. https://doi.org/10.1109/ACCESS.2024.3467008