library
What a counting horse can teach us about evaluating AI tools and systems.
Clever Hans, the horse mathematician, was a worldwide sensation in 1904. With skill comparable to a human grade schooler, the horse solved basic arithmetic problems.
"Ten minus six?" His owner would ask.
"<stamp>, <stamp>, <stamp>, <stamp>". Clever Hans would stamp the ground four times to give his answer.
While many were impressed by the animal's numeracy, some remained skeptical. And eventually — after years of performing for enthusiastic crowds — that skepticism proved to be well-founded.
Without realizing it, Clever Hans' owner reacted with expectant body language when the horse produced a correct answer. Clever Hans had accidentally been trained to recognize his owners' anticipation, and other subtle cues, rather than abstract mathematical concepts.
Today, a century later, something similar is happening with many artificial intelligence products. From overfitting to short-cut learning to the perpetuation of hidden biases, there are many ways these products can fool us, even when they appear to work.
Just as Clever Hans's owner did not know his horse was cheating, complex artificial intelligence systems can fool us with specious performance. This is a central problem in the procurement of new tools and technologies. If the developer of an AI system doesn't know exactly what the system is doing, a would-be purchaser is completely in the dark.
How do you know if your AI is a "Clever Hans"? To start, you might want to assume that it is, until proven otherwise.
Ultimately, to address AI's Clever Hans problem, we need to revolutionize the way AI systems are produced and evaluated. This won't happen overnight, but in the remainder of this post, we outline two partial solutions: AI Audits & Evaluation Authorities.
AI Audits
If you're building — or looking to purchase — an AI system, how do you know if the technology does what you think it does? To answer this question, you need to characterize the behavior of the system, identify limitations, and uncover risks. Auditing an AI system is one way to do this.
IQT Labs is developing a comprehensive approach to AI auditing, informed by the AI Ethics Framework for the Intelligence Community. Most recently, we audited a Large Language Model (LLM) called RoBERTa, focusing on how well the model identifies named entities (people, organizations, etc.) in unstructured text. During our audit, we identified a range of potential biases and harms, which we then evaluated statistically.
Comprehensive audits like this are a good way to reveal risks and to help future users understand the limitations of AI systems. However, these audits can be time-consuming and resource intensive, requiring a significant amount of custom effort. For example, our RoBERTa audit took several months to complete, with eight people contributing to the work.
How can we do this work more efficiently? What can we automate? Can we move from this expensive form of auditing to something more cost-effective and sustainable — something that looks more like market standards for AI systems?
Evaluation Authorities for AI
An "evaluation authority" is a trusted, third-party entity that sets standards. An evaluation authority for AI could build upon the context of a completed audit, to enforce performance requirements, maintain evaluation datasets, and root out parlor tricks for a broad class of AI systems.
The risk identification process is inherently qualitative and must be conducted by humans, but the downstream statistical evaluation of those risks is an artifact of code, data, and requirements that can be programmatically evaluated. Consequently, the first part of an audit establishes the expectations for a system and the second evaluates whether the system is adequate to the task. While the effort for a first audit is considerable, the marginal cost for future evaluations can be effectively zero — if organizations are willing to "open-source" their auditing work in the form of an evaluation authority.
Today, no organization has the primary goal of programmatically certifying AI systems, but collectively, we can establish evaluation authorities for AI in an efficient and scalable way by drawing on findings and results from completed audits. (We provide additional guidance in this recently published paper on Data Centric Governance, which outlines the technical foundation for creating an evaluation authority.)
Not all systems require audits, but evaluation authorities can be the byproduct of all AI audits.
In the coming weeks we will demonstrate this idea through a case study. Drawing on our audit of RoBERTa, we will develop an evaluation authority for named entity recognition (NER) and use this to evaluate a collection of other language models that offer different strengths for the same problem. By making this work public, others who are interested in procuring a named entity resolution system could ask a vendor to compare their solution to performance statistics from previously tested models. In effect, our RoBERTa audit could create a market standard for named entity recognition.
An additional benefit to developing an evaluation authority in this manner is that it can be expanded incrementally to cover additional risks as they are identified. In the RoBERTa audit, we identified erasure as a significant risk -- some entities are systematically not recognized. In the future, additional tests could easily be added to tease out other risks.
Eventually, some form of evaluation authority emerges for every new technology. Stamped on the bottom of your laptop's AC adapter are the letters "UL Listed," which stands for Underwriters Laboratories -- an organization that has worked to test and standardize electrical safety since 1894. No one wants to buy an electrical system in the absence of electrical standards; without organizations like UL, electricity would be far too dangerous to embed in the walls of our homes and businesses.
The safety and fairness problems of AI are just as profound, but today, AI products operate without any comparable regulatory standards or building codes.
The era of moving fast and breaking things is over. The best path to market for modern AI systems will run through efficient and comprehensive evaluations.
If you purchase an AI system that was certified by an evaluation authority, you can trust that the system will perform as expected and allow for continuous improvement. This trust is also fundamental for a well-functioning market. When we know we can trust the performance of AI systems, multiple providers can compete over the qualities of their solutions rather than the skill of their marketing.
Interested in how this all works in practice? Check back here over the next few months to see our progress!
Credits and Acknowledgements
The Clever Hans framing is inspired by Ian Goodfellow and Nicolas Papernot's Clever Hans blog, which is an excellent resource for understanding the current state of adversarial machine learning. More recently, Kate Crawford's superb AI social impacts book, "Atlas of AI," makes reference to the horse.