Avoiding a New Phrenology

Seeking Ground Truth at the Crossroads of Behavioral Science and AI

Jeff Brodscholl, Ph.D
Greymatter Behavioral Sciences

A Riddle

Four phrenologists and a machine learning algorithm are in an abandoned airplane hangar with a million human heads. The phrenologists wish to develop an AI-based tool that uses data from a live video feed to detect whether an in-store customer has the personality of a “market maven” – that is, whether they are someone with the unique constellation of high customer involvement, deep consumer knowledge, and consumer thought leadership that makes them an appealing marketing target given their level of brand engagement and potential influence on other customers [1].

The phrenologists develop an initial head coding scheme based on their theory of the market maven personality type and set forth to independently classify 1,000 of the heads using the coded head features as input to their classification rule.
After comparing notes and resolving discrepancies, the phrenologists refine the coding scheme, re-code and re-classify the 1,000 heads, and achieve kappa (a measure of rating concordance) of 0.92 on the coding and 0.91 on the classification. They then proceed to code and classify the remaining 999,000 heads using the final coding scheme.
The heads now coded and classified, a machine learning algorithm is trained to classify the heads using the features that were coded by the phrenologists as raw input for classification. Training and model selection is conducted on 700,000 randomly selected heads, with the remaining heads set aside as a holdout sample for final testing. After training, the best-performing model is applied to the holdout sample, and it demonstrates a classification hit rate of 0.89 with an area under the ROC curve (a measure of the model’s ability to simultaneously maximize hit rates and minimize false alarms) of 0.94.

So here’s a question:

Do we now have a validated ML-based method for detecting market mavens?

If you know anything about phrenology, then you already know the answer to the question. It’s a bit of a gratuitous example for sure, but it’s the very obviousness of it that helps surface some of the more obscure problems that can exist with the use of AI-based tools for behavioral insight – problems that are worth considering as such technologies start to appear with soaring claims about their ability not only to detect hidden behavioral patterns from primary and secondary research data, but to identify deep behavioral processes and biases from those data, and to go about it with stellar accuracy rates to boot. These tools may seem like they'd be a remarkably new innovation, but they actually share a few key things with earlier attempts to automate the detection of people’s emotional states from their moment-to-moment facial expressions using machine learning algorithms as the engine of detection. These facial expression recognition technologies, or simply “FER” for short, have captured the interest not only of industry, but also of academic researchers who have been interested in understanding how well they hold up under a variety of real-world conditions. Their work, along with the efforts of AI engineers to develop new, improved forms of FER, has led to a publication paper trail that not only tells us something about the reality when it comes to FERs specifically, but also about how any attempt to use AI to infer people’s hidden thought patterns, feelings, or biases can run off the rails without the consumers of these technologies ever being aware of it. This post digs into this literature to try to pry out these lessons.

For the present purposes, I keep the focus on what a relatively rapid review of publicly-available data has to say about FER, using a combination of search terms and snowballing to gain decent coverage and keep the review as fair and balanced as possible. To be clear, I have no axe to grind with either FER or AI, nor do I make any predictions, good or bad, about how these technologies will likely evolve or be used in the years ahead. Most technologies start imperfectly, some become much better, and, usually, their purposes can have both downsides and upsides depending on how we choose to use and adapt to them. (I also have no axe to grind with the creators of the latest AI-based insights tools, as I assume some of them have been quick to get these tools out the door under a certain felt pressure to do so.)

But, as a behavioral scientist, I do have a vested interest in understanding how the latest tools likely perform and why, particularly when claims are made about their ability to read deeply into people, and they are then pushed out to the world as wonder tools for behavioral insight when, in their current state, their promotion as such may amount to little more than behavioral science snake oil – a carnival act that does little for behavioral science’s long-term industry reputation and only makes it difficult for sober, substantive research and analytic work of any kind to gain a proper toehold. Because the versions of these technologies that have launched thus far exist in a bit of a black box state, we can only currently look at them from the outside – but it’s for this reason that what’s publicly available about their distant FER cousins can be essential in helping us anticipate what goes into building AI tools like these, the issues that can come into play with them, and the types of questions we need to be asking when they’re promoted to us as the most exciting development since the Implicit Associations Test.

As you’ll see, we don’t need to evoke our fictitious phrenology riddle to be skeptical of the claims that are made about these emerging technologies; what we learn about FER gives us a real-world analogue from which we learn plenty about what we need to be on the lookout for with them, and what their developers need to show for us to believe in them beyond a near quasi-religious faith in the miracle powers of AI alone.

The Example of Facial Expression Recognition (FER): A Primer

Before going further, we might ask what exactly an AI-based tool for identifying the hidden biases or behavioral drivers of a particular group of people would have in common with a narrowly-focused emotion detection tool such as FER. To answer this question, we need to look at how FER does what it does, keeping in mind not only the mechanics, but also the conceptual arguments behind why the technology works as a detector for what is ultimately a private experience – something that, like the use of a heuristic in a person’s judgments, is not directly available to an outside observer, but may be inferred from a close examination of a particular behavior in which its presence is reflected.

The conceptual argument is generally quite intuitive: In short, it’s that people’s facial expressions convey something about their underlying emotional states, and that they do so with a sufficient regularity that we ought to be able to accurately read a person’s emotions from a careful analysis of those expressions. This turns out to be more than just an intuition; it’s an idea that has been codified into a famous psychological theory that asserts the existence of universal basic emotions that can ostensibly be found in every culture on Earth, and for which there are corresponding facial expressions that allow someone from the United States, France, or Russia to correctly recognize them in the expressions of someone from Mongolia, Papua New Guinea, or the northernmost reaches of Canada [2,3]. Connected to this theory is the notion that we can systematize the measurement of facial expression to support the scientific study of emotion expression universals – the most famous example being Paul Ekman’s Facial Action Coding System, or “FACS”, which uses an elaborate scheme to break any given facial expression into a set of individual muscle movements, or “action units” (AUs) [4,5]. This coding scheme is supplemented with a separate set of tools that allow an analyst to create a bridge from a given pattern of AU codes to a specific emotion category [6], making it possible for an emotion inference to be based on a formal analysis of facial expression rather than on a potentially error-prone or biased subjective judgment.

Given that we aren’t always well-tuned to the emotions betrayed in people’s behavioral cues, it’s not surprising that Ekman’s work should have eventually caught the eye of professionals in marketing, consumer research, and forensics who have a practical interest in emotion assessment, but who believe that some expressions may be too subtle or transient for observers to catch their meaning absent more rigorous methods. These interests have created a market for FACS-based services which, while promising in principle, have not always been well-served by the complexity of the FACS coding system:

Facial expressions must be reviewed by manual coders who are required to go through many hours of training before they can begin applying FACS on their own [7]
Coding can take anywhere from 24 minutes to 2 hours for a single one-minute facial expression video depending upon the level of rigor required [7]
Lastly, none of this is useful for measuring people’s real-time, real life emotional experiences, which are often sought after both for research and for implementation in real-world applications (e.g., robotics)

Enter automated facial expression recognition. Unlike the manual application of FACS, FERs make use of advances in the accessibility and affordability of technology to turn the process of emotion detection over to machines that can do in a fraction of the time what manual coders would need in order to do the work. These approaches use machine learning (ML) or more-advanced “deep learning” algorithms to estimate the probabilities of certain emotions being present from the same types of facial feature configurations that are the target of analysis in a manual coding system such as FACS [8-9]. The algorithms are developed, in part, by being exposed to a large body of facial expression stimuli that have already been labeled with the types of emotions they express. Often the stimuli are pictures of faces that have been posed to create highly-prototypical patterns of muscle movements that would be expected of the basic emotions; sometimes, though, they are more natural expressions captured in pictures and videos that have been culled “from the wild” (e.g., the internet) [8-9]. Procedures are then implemented to find the model that does the best job of accurately classifying the stimuli, and to demonstrate its ability to continue accurately classifying facial expressions to which the algorithm hasn’t been previously exposed. When done with the proper care, the result is an algorithm that promises to do within seconds what a human coder would take far more time to accomplish, allowing people’s facial expressions to be captured using a wide variety of camera technologies and analyzed under a range of stimulus conditions [8] – all essential to enabling FER's deployment in real-world conditions where the value of emotion detection can be properly realized.

Examining FER Performance: What Do the Data Tell Us?

Given the properties of these algorithms, its perhaps no surprise that they should have taken off to form what, by 2020, was already a $3.8 billion market, and which is expected to grow to $8.5 billion by 2025 [10]:

They leverage ML and deep learning at a time when AI and AI-related technologies are deeply respected and in high demand
They’re directed at measuring an aspect of human experience – emotion – that is assumed to play an important role in people’s reactions to communications, their behaviors toward products and brands, and their interactions with machines, devices, and the features of their worlds
They fit the belief that many important emotional experiences are too hard to detect by human observers, and too prone to self-report biases owing to blind spots and social desirability effects, that we need to turn to machines for more reliable measurement
And they promise to make emotion detection something that can be easily implemented in the real world, allowing for data capture that can then be used for real-time insight, experience tailoring, and intervention delivery in both physical and digital environments

But do these technologies really do what they say they do? Consider, for starters, the following:

Figure 1: Commercial FER and human facial expression classification performance observed in a sample of published studies using tests across a variety of standardized and unstandardized facial expression stimulus sets [11-23]. Color-coded lines represent average performance for a particular type of stimulus; color-coded points represent results of individual tests. Overall averages (white lines) include one test of FERs and humans that used a single stimulus set composed of facial expressions drawn from 14 standardized sets of posed and spontaneous facial expressions [13]. Averages take account of the number of stimuli used in each test (or, in the case of human performance, the number of participants by the number of stimuli per participant). Note that, with the human tests, sample sizes with UT-Dallas and AffectNet are < 15 (albeit with large numbers of ratings per participant) and should be interpreted accordingly.

This figure presents the hit rates, or percentages of ostensibly correct classifications, that are observed across six commonly-considered basic emotion categories when the performance of commercially-available FERs and humans is measured against facial expression stimulus sets ranging from static and dynamic posed expressions to spontaneous expressions generated under controlled conditions versus those that appear “in the wild” [11-23]. The FER tests also include a few instances in which commercial FERs are used to generate AU intensity scores that are then fed into custom-trained open-source classifiers – a maneuver that produces patterns similar to the ones obtained when the commercial FERs alone are used for classification. I’ve not broken out the charts by the included FERs as my interest wasn’t to either elevate or “fry” any one FER, but I’ll point out that they do include some the most popular and frequently-tested FERs that have been available since the 2010s. I’ve also restricted the FER tests to those published after 2017, and have excluded FERs that were tested on only a few challenging stimulus sets, in the interest of focusing on the most recent models and avoiding putting any one FER at an unfair disadvantage. Note that the sample of human performance tests leans heavier toward posed images relative to the FER tests, but this only affects the overall averages, not the color-coded breakouts. I'll also note that the results obtained with humans is quite similar to what has been reported elsewhere, including with respect to response latency data (e.g., slower processing times for basic emotions associated with lower hit rates – a result that takes on added significance as the discussion below unfolds) [24-26].

What stands out when one looks at these charts is this:

On average, humans do as well as, and often beat, the FERs – though there are instances where human performance is hardly stellar
For both humans and FERs, there’s considerable performance variability both within and across emotions, with happiness and posed expressions tending to yield the best performance, and fear and spontaneous / in the wild expressions tending to yield the worst
Finally, and quite oddly, the pattern of variability tends to be similar for both the humans and the FERs, both in the shape of the profile across emotions and the way performance decreases as expressions become more natural

The first two simply restate what is obvious from the charts – but the third could do with some unpacking. Why, exactly, would we say that the performance similarities are “odd”? Assume we let go of the pretense that FERs are here to “beat” the humans; that has implications for firms that promote FERs as a more accurate way to capture people’s emotions relative to human observers, but it’s hardly a problem for cases where we just want FER to be close enough to human performance to make it a suitable tool for certain practical needs such as detecting people’s emotions as they take actions with objects and events in their natural environments. In these instances, we might be willing to overlook some of the poorer performance of FER relative to humans when the cost of inaccuracy is low, and we might also be open to overlooking the failure of FER to make up for the cases where, on average, people tend to perform worse than one would hope. But, from an overall pattern perspective, we’d probably welcome FERs behaving in a way that’s similar to how people perform, for wouldn’t that tell us that we’ve managed to build something that’s pretty close to the way human minds infer emotions from facial expressions? People may not be perfect at emotion inference, but their abilities are quite remarkable in the larger scheme of things; if we had to settle for a performance goal, wouldn’t we be content to turn to those abilities as a model for FER emulation?

The problem, of course, is that whatever FER is, it isn’t a model for how human minds work. That isn’t to say that the two might not turn out to share some functional similarities, but it would be a fiction to assume that the decisions developers have made to use random forests, decision trees, or, more recently, convolutional neural networks in their FERs has ever been predicated on an assumption that those models correspond to a well-articulated, well-supported theory about the mental processes involved in emotion inference. It’s for this reason that the parallelisms between human and FER performance is, in fact, a bit odd: It’s not what one would expect if the two were arriving at emotion inferences through relatively different routes. Similar patterns arise with the more-recent deep learning-based FERs despite being trained on stimuli that are more naturalistic than what has often been used in the past [27-40]:

Figure 2: FER facial expression classification performance observed across a convenience sample of experimental state-of-the-art models trained and tested on a variety of standardized facial expression stimulus sets [27-40]. As with the commercial FERs, only tests reported after 2017 are included. Color-coded lines represent average performance for a particular type of stimulus; color-coded points represent results of individual tests. Note that only simple averages are used, and the sample is likely skewed toward reports that used "in the wild" expressions [9]. Note also that, unlike the FERs in Fig. 1, all models were demonstrably tested and trained on the same type of stimulus set. Overall performance is better than what's seen in Fig. 1 with current commercial FERs, but with a relative loss in performance on "disgust", and a continued tendency for the best performance to occur with "happiness". Results with SFEW (called out for comparison to Fig. 1) and AFEW show that some of the sampled FERs continue to struggle with these particular "in the wild" sets.

The Common Denominator: What's In a Facial Expression?

That leaves the stimuli against which FER and human performance is assessed as the common denominator between the two, and it raises the question regarding what, if anything, about the stimuli could be causing lower hit rates for certain emotions and stimulus conditions relative to others.

Some of the pattern could be explained by stimulus features that one would naturally expect to cause hit rates to decline. Some, that is, but not all of it. For instance, the fact that expressions captured “in the wild” pose a challenge for both humans and, to a greater extent, FERs could simply be due to the way lighting conditions, head orientations, and other sources of real-world visual noise can bedevil any visual perceptual system – something that may potentially be less of a problem for humans but is still in need of improvement among the FERs. Similarly, performance decrements with spontaneous expressions could reflect the fact that those expressions are likely to be more diverse and include instances that do not tightly overlap with a given emotion’s prototypical expression. That’s certainly plausible given the way facial expressions are theoretically assumed to be organized – though one has to wonder at what point the resulting performance reflects confusion over peripheral cases as opposed to something fundamentally wrong with idea that the expressions are organized around the assumed prototypes to begin with.

Harder to account for, though, is the performance variation between emotions, which, in the case of humans, doesn’t make much sense if every one of the emotions targeted by FERs is ostensibly basic with a universally understood set of expressions associated with it. Again, we could resort to within-class diversity to try to explain it, but we’d need to understand why the diversity should be so substantial for certain emotions over others assuming the standard theoretical account of facial expressions isn't lacking a critical ingredient. (That's assuming we're only talking about spontaneous expressions, since, for highly prototypical posed expressions, the explanation wouldn't even apply.) Granted, some negative emotion expressions such as with fear tend to appear less frequently in the real world, and severe class imbalances can induce biases into classification systems that acquire their skills by being exposed to stimuli in which the imbalances may manifest [41]. Yet, there’s reason to believe that this explanation doesn’t quite account for the pattern, either, as maneuvers to correct for these imbalances don’t always work well in improving FER accuracy rates [29,36], and the same performance issues can emerge even when FERs are developed with training sets that are not so severely imbalanced [39].

The Deeper Problem: Questionable Validity Amongst a Shaky "Ground Truth"

Ultimately, if we want to achieve a deeper appreciation of what it is about the stimuli that might be driving the performance data, we need to take a step back to the core theoretical assumptions that underlie most FER technologies – assumptions that go back to the Facial Action Coding System and the presupposition that:

There are certain basic emotions that are found across cultures and contexts
There are universals in the way these emotions are expressed
These universals can be captured by decomposing a facial expression into a set of discrete muscle movements, or “action units”

These assumptions are hardly academic. In the case of FERs, they drive decisions about what tasks an FER needs to accomplish and exclude those that are never considered because they’re simply not assumed to be relevant. They also set expectations about what it is in the data an FER algorithm will have available to it that will allow it to form a bridge to the underlying, hidden emotional states that the algorithm is meant to detect. And, because FERs develop their skills by learning on training data, the assumptions further dictate what stimuli are selected for training and how they are designed, which then becomes yet one more way for the assumptions to get incorporated into the FERs themselves.

The last point is particularly noteworthy. To state the obvious, FERs do not really detect emotion at all; they merely perform a classification task in which they estimate the probability that a given emotion is associated with a particular facial expression based on the configuration of physical features that characterizes that expression. To acquire this ability, FERs need to be trained with stimuli that contain not only images of facial expressions, but also labels for the emotions that are ostensibly associated with those expressions. These labels don’t come out of nowhere; they come from a critical step in which a human judge acts as the arbiter of the emotions underlying the expressions. Sometimes the labels are obtained from judges who are naïve to the FACS coding system [42,43]; rarely, they may be provided by the very people whose expressions appear in the stimulus sets [44,45]. Yet, more often than not, the labels are the work of trained raters and stimulus developers who are explicitly guided by FACS and its associated theory of universal emotion [20,21,46-49] – the same theory that is evoked, and then relied upon, to develop stimulus sets in which expressions are posed under careful FACS-based instructions (e.g., [46,49]). Once the labels are developed, they become a “ground truth” – a gospel of sorts against which a human or FER is judged to be accurate when asked to classify the stimuli into one or more emotion categories. In effect, the labels come to be treated as if they were an objective reflection of the emotions underlying the expressions, quietly forgetting that they are the product of a set of inferences that is as much removed from the presumed underlying emotions as are the probability estimates generated by the FERs.

But if the labels are, themselves, based on an inference, then what precisely is the validity of the labels?

To answer this question, some stimulus developers argue that their labels are valid to the extent that they correspond to the judgments of an independent group of raters who are asked to rate the stimuli without any special instruction or guidance [19,20,22,23,50,51]. Yet, these tests only work to show whether the yardstick applied by the people who developed the labels happens to be similar to the one used by everyday people when they rely on their native faculties to classify the facial expressions in the stimulus sets. The results of those tests are what we see in the bottom panel of Figure 1 – and, as that figure shows, the agreement isn’t bad, but it isn’t exactly stellar, either, dropping off for certain emotions, and becoming harder to secure the more naturalistic the expressions become. That isn’t necessarily a major problem to the extent that the discrepancies reflect reasonable differences in the methods by which the two groups arrive at their judgments – but it can portend a real problem if it’s a function of the members of either group struggling to agree even amongst themselves about what the stimuli mean.

Yet, it turns out that such disagreements may, in fact, be quite common, yielding labels that are likely quite noisy owing to the difficulties even formal raters have arriving at a clear, mutual understanding of the emotions that underlie the expressions in the stimulus sets. An example appears in the purple line for AffectNet (Fig. 1), which reflects the interrater agreement of trained coders who were tasked with creating the labels for the images in this set [21]. Rarely does the agreement exceed 70%, and the overall pattern points to challenges with negative emotions that generally bedevil people even when they’re asked to classify posed, static expressions for which hit rates tend to be much higher. Similarly modest levels of agreement have been observed elsewhere (e.g., [47,50,51]), leading some FER developers to treat it as a nuisance that should be dealt with either by tightening up the methods for stimulus coding [42] or by applying machine-based maneuvers for “correcting” the labels at some point prior to or during FER training [40,47]. But this conveniently overlooks the fact that, if the labels are extremely noisy, they need to be treated like a rubber ruler that may have something quite fundamentally wrong with them beyond a weak signal that happens to live among a slew of random noise. Absent a more rigorous test of the signal and what it means, efforts to prop up the labels can end up being little more than a way of saying that the labels must be valid simply because the FER developers say that they are – an attitude that requires considerable faith in the theory on which the stimulus development process is predicated. Yet that faith can be hard to justify once the human performance data are carefully considered – for if the theory underlying the development of these stimuli is so good, then why do human judges, whether acting as formal raters or as naïve test subjects in an expression classification study, fail to perform with these stimuli at a level the theory would predict?

The answer, alas, is that the theory may not be that good – certainly not to the extent that the creators of FACS and many others take for granted. As one remarkably thorough review published in 2019 helps illustrate, the problems are several, ranging from cultural differences in the emotional meanings ascribed to facial expressions that then get reflected in the way emotional expressions are posed or interpreted, to developmental trajectories that, unlike expressions of general positive and negative states, show clear effects of experience and learning on the way action units get organized around specific emotions during infancy and childhood [7]. Running through all this is the more basic insight that people’s expressions likely aren’t wired to their emotional experiences the way tuning a radio to 98.3 FM will always bring in whatever is on 98.3 in whatever area you happen to be in, static and signal strength notwithstanding. On the contrary, they appear to be highly context-dependent, being sensitive to social rules, situation-specific goals, and cultural understandings that can cause any one emotion to be expressed many different ways, and any one expression to serve multiple meanings, beyond what the theory of universal emotion expressions can fully accommodate. This reality is almost entirely excluded from the assumptions underlying FACS and most FERs, and it is certainly without parallel in the development of most facial expression stimuli given their tendency toward extreme decontextualization – yet it is a remarkably intuitively obvious reality to almost anyone who self-regulates their emotions or is tasked with making sense of another person’s expressions in the real world.

To see just how obvious it is, compare someone trying to rate the images in an FER stimulus set to a real-world observer who, over a 2-3 second interval, notices someone put on a brief smile accompanied by a quick, slight breath out of their nose, a brief raise and lowering of the eyebrows, and a tilt of the head slightly to the side and downward with eye gaze moving in tandem. How does the observer make sense of the bit of body language they’ve just witnessed?

Is the smile a smile of recognition – a reaction, perhaps, to hearing someone say something that they knew to be true all along?
Is it a minor bit of amusement in response to an okay-but-otherwise-dopey joke?
Is it actually a negative reaction – for instance, a sense of amazed disbelief in response to having been insulted (essentially, body language for “incredible – what a jerk”)?

To arrive at a proper inference, the observer can’t just look at the face in isolation; they need to be aware of the context in which the behavior occurred and have a proper understanding of how the combination of contextual cues and person characteristics likely gives rise to a particular emotion and way of expressing it. Thus, if, in our example, the aforementioned smile occurred following an insult, and if the observer has reason to believe both that the insult was likely out of line and that the target of the insult is not someone who’s likely to indulge an outburst, then the observer will have good reason to believe that the body language was an expression of disbelief, not of recognition or amusement – and they’d probably be in a good position to bet money on it without taking on much risk.

Yet, notice the chain of reasoning that goes into this judgment. It includes more than just an inference about the person’s emotional state; it also includes a set of predicate inferences about the beliefs, desires, and behavioral inclinations of both the insult-giver and the insult-taker that help the observer understand what the latter is feeling in the moment. These inferences aren’t foolproof; they’re the product of knowledge about social categories and mores, perspective-taking processes, person representations, and causal inference processes that can be systematically biased or insufficiently informed. Yet, they exist as critical components of the observer’s understanding about the other person’s emotions, for a good reason: They reflect the reality that what people feel, and how they express it, is a function of mental processes and contextual cues that we naturally take for granted, but that have no analog in the way FER algorithms and their stimulus sets are developed. And even if additional contextual features were adequately captured in an FER’s training and test sets, they’d simply fall by the wayside as the algorithms wouldn’t be designed to take that information into account to begin with.

That FER algorithms may miss the boat by failing to consider context is something that has already been acknowledged within the FER community, and it has given rise to attempts to develop emotion recognition technologies that go beyond facial expressions to leverage much more sophisticated combinations of behavioral and contextual information [52-55]. Yet, none of this changes the fact that, as of now, most of the FERs that have been developed to date rely on a set of assumptions about emotion expression that constrains both their architecture and the stimuli on which they are trained [9] – the latter to the potential detriment of the validity of the foundation on which these FERs rest. That isn’t trivial; it may very well introduce a kink in the FER development chain that dampens the validity of the FERs themselves as a trustworthy tool for inferring real-world emotional states. In that sense, talking about FER accuracy rates can become a bit meaningless, as what the FERs are really doing is trying to match their predictions to the way stimuli with limited contextualization are rated by people who must make judgments about them absent the cues they would normally use to infer other people’s emotions – in some cases, under the explicit influence of a theory that does not have a place for context in emotion expression. The resulting FERs could achieve average accuracies above 95% and it wouldn’t detract from the fact that the predictions may have limited validity owing to dubious assumptions about the quality of the relationship between the emotion labels in the stimulus sets and the emotions that, in the real world, would be connected to the expressions captured in those images.

The Takeaway: With AI for Behavioral Insight, It's the Validity of the Presumed Ground Truth That Matters

What’s perhaps so striking about the FER example is that it involves the application of ML and deep learning to something that’s seemingly quite simple and straightforward. We’re not talking about trying to make inferences from a large corpus of incredibly complex data the way, e.g., ChatGPT learns the style of Mark Twain from digesting the cannon of Western literature, and we wouldn’t think that face-reading would be so challenging that it would depend upon many orders of magnitude more in complexity to successfully automate. The number of emotional experiences, though quite varied and culture-bound, is still relatively finite, dimensional theories notwithstanding; so, too, are the ways in which they are likely to be expressed. And, as noted even in the 2019 review cited earlier, there are plenty of instances where, when it comes to their emotional meaning, a smile really is a smile, and a frown really is a frown [7].

Yet, if we return to the emerging AI-based technologies that originally motivated this post – the ones that claim to accurately infer even deeper, more complicated phenomena from whatever data they are fed – and we consider that these tools must necessarily stand in the same relationship to these hidden phenomena the way FERs relate to emotions and the facial expression stimuli on which the models are trained, then what do we think might be the hope for these AI-based tools, given what we just discussed about FER?

To be fair, the answer shouldn't be generated in a vacuum; much depends on what it is one wishes to detect with these methods, how they go about it, and the data on which the methods are developed. Some characteristics of people may be more given to capture with well-validated instruments or show up more reliably in behavioral signatures that are easy to spot; others may only reveal themselves in far more complex or distal data configurations that, themselves, need to be carefully vetted for their relationship to the characteristics one is trying to detect before they can be treated as any form of "ground truth". To give an example, consider a chronic cognitive tendency such as “need for closure” that has validated measures available for it [56] which could conceivably be administered to people in a digital environment where their interactions with a diverse array of apps and experiences could then be tracked. These data, when pooled together, could very well be used to train an algorithm to distinguish high and low need for closure users based on their behaviors in the observed contexts, and the classifier would at least be potentially valid to the extent that the behavioral signatures are both reliable and correlated with indices derived from the validated measures. Now contrast that, though, with the case of a bias such as “inattentional blindness” that could plausibly be reflected in some piece of data (e.g., something someone reports about their behavior in a past context or is observed to do in that context), but which is inferred strictly through analyst judgment without the benefit of clearly face-valid linguistic markers or other diagnostic information that would normally be needed to confirm the presence of the bias. These inferences might be good enough for everyday qualitative analysis or for post-hoc interpretation of behavioral patterns in a secondary dataset, but they'd hardly be a solid basis for training an ostensibly-valid algorithm to detect the bias in a larger corpus of data absent other evidence to justify their inclusion in the algorithm's training. (Note that this problem wouldn't afflict an algorithm that's developed strictly to uncover observable if-then patterns in behavioral data, as the algorithm would simply need to be built to perform based on what's clearly manifest in the data, unencumbered by the incorporation of inferences that may or may not be plagued with dubious validity.)

In addition to the above, we also need to consider when and where an AI tool will be used before we can select the standard against which to evaluate it. Again, consider an FER which may be quite poor at detecting certain emotional experiences and have limited generalizability beyond the canned, canonical facial expressions on which it’s trained. That may all prove to be true, but if the goal is simply to have some read on a person’s emotional reactions as they are doing something in the real world (e.g., scanning content on an app), and if it’s the strong, unequivocal, positive reactions that are going to matter the most, then, in that case, the shortcomings of the technology might not make that much difference relative to the benefit of having some kind of real-time emotion detection in the moments where the technology actually gets it right. The same could also be said for the use of AI in real-world behavioral intervention delivery, where an AI tool could do a perfectly fine job of correctly predicting what action to take, when and with whom, to encourage a particular behavior (e.g., taking medication as prescribed), yet provide little real insight into how it’s making those predictions, even if it seems to be basing them off of some behavioral principle regarding human motivation, cognition, habit formation, and so on.

Yet, these cases are all quite different from one in which the purpose of the tool isn’t to provide a crude measurement of an internal state or to simply support effective intervention tailoring, but to be an engine for insight that’s supposedly reliable and valid, to be used the way assessments of attitudes, motivational orientations, cognitive styles, and potentially more transient phenomena, such as a tendency toward omission bias in a particular decision context, can be used in research to tell us about the hidden psychological processes that characterize an individual or group of people, dispositionally or otherwise. In that case, the burden of proof isn’t satisfied by accuracy rates alone; it’s satisfied by a disclosure regarding how these technologies work, along proof that the way they have been developed makes them valid detectors of the phenomena they purport to surface. These disclosures aren’t unreasonable to request; they’re, in fact, quite appropriate to ask about when you consider how quickly people may defer to AI-based technologies precisely because they make use of fancy algorithms that are trained on massive amounts of data, which can make them seem ostensibly exempt from the foibles of everyday human thinking. And the inquiry is particularly justifiable when you consider how these tools may be promoted with claims about detecting phenomena that may leverage the aura of science, but without proper consideration for how the phenomena they target may evince themselves in very complex behavioral signatures that require considerable care to properly surface and interpret (and even then, with caveats).

If a provider of these technologies can give you that evidence, then great – go for it! Until then, think carefully about why you'd want to make use of these tools – and caveat emptor.

References (Were We Got All of This)

Clark, R.A., & Goldsmith, R.E. (2005). Market mavens: Psychological influences. Psychology & Marketing, 22, 289-31. https://doi.org/10.1002/mar.20060.
Ekman, P. (1999). Basic emotions. In T. Dalgleish & M. Power (Eds.), Handbook of Cognition and Emotion (pp. 45-60). John Wiley & Sons, Ltd. https://doi.org/10.1002/0470013494.ch3.
Ekman, P., & Cordaro, D. (2011). What is meant by calling emotions basic? Emotion Review, 3, 364-370. https://doi.org/10.1177/1754073911410740.
Cohn, J.F., Ambadar, Z., & Ekman. P. (2007). Observer-based measurement of facial expression with the Facial Action Coding System. In J.A. Coan & J.J.B. Allen (Eds), Handbook of Emotion Elicitation and Assessment (pp. 203-221). Oxford University Prees.
Ekman, P., & Friesen, W.V. (1976). Measuring facial movement. Environmental Psychology and Nonverbal Behavior, 1, 56-75. https://doi.org/10.1007/BF01115465.
Clark, E.A., Kessinger, J., Duncan, S.E., Ball, M.A., Lahane, J., Gallagher, D.L., & O’Keefe, S.F. (2020). The Facial Action Coding System for characterization of human affective response to consumer product-related stimuli: A systematic review. Frontiers in Psychology, 11, 920. https://doi.org/10.3389/fpsyg.2020.00920.
Feldman Barrett, L., Adolphs, R., Marsella, S., Martinez, A., & Pollak, S.D. (2019). Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological Science in the Public Interest, 20, 1-68. https://doi.org/10.1177/1529100619832930.
Li, S. & Deng, W. (2020). Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing, 13, 1195-1215. https://doi.org/10.1109/TAFFC.2020.2981446.
Masson, A., Cazenave, G., Tombini, J., & Batt, M (2020). The current challenges of automatic recognition of facial expressions: A systematic review. AI Communications, 33, 113-138. https://doi.org/10.3233/AIC-200631.
Ammanath, B. (2021). Facial recognition: Here’s looking at you. Deloitte AI Institute report. Accessed 8/12/2024. https://www2.deloitte.com/us/en/pages/consulting/articles/facial-recognition.html
Budenbender, B., Hofling, T.T.A., Gerdes, A.B.M., & Alpers, G.W. (2023). Training machine learning algorithms for automatic facial coding: The role of emotional facial expressions’ prototypicality. PlosOne, 18(2), e0281309. https://doi.org/10.1371/journal.pone.0281309.
Dupre, D., Krumhumber, E.G., Kuster, D., & McKeown, G.J. (2020). A performance comparison of eight commercially available automatic classifiers for facial affect recognition. PlosOne, 15(4), e0231968. https://doi.org/10.1371/journal.pone.0231968.
Krumhuber, E.G., Kuster, D., Namba, S., & Skora, L. (2021). Human and machine validation of 14 databases of dynamic facial expressions. Behavior Research Methods, 53, 686-701. https://doi.org/10.3758/s13428-020-01443-y.
Kuntzler, T., Hofling, T.T.A., & Alpers, G.W. (2021). Automatic facial expression recognition and non-standardized emotional expressions. Frontiers in Psychology, 12, 627561. https://doi.org/10.3389/fpsyg.2021.627561.
Li, Y., Yeh, S., & Huang, T.R. (2023). The cross-race effect in automatic facial expression recognition violates measurement invariance. Frontiers in Psychology, 14, 1201145. https://doi.org/10.3389/fpsyg.2023.1201145.
Skiendziel, T., Rosch, A.G., & Schultheiss, O.C. (2019). Assessing the convergent validity between the automated emotion recognition software Noldus FaceReader 7 and Facial Action Coding System scoring. PlosOne, 14(10), e0223905. https://doi.org/10.1371/journal.pone.0223905.
Stockli, S., Schulte-Mecklenbeck, M., Borer, S., & Samson, A.C. (2018). Facial expression analysis with AFFDEX and FACET: A validation study. Behavior Research Methods, 50, 1446-1460. https://doi.org/10.3758/s13428-017-0996-1.
Calvo, M.G., & Lundqvist, D. (2008). Facial expressions of emotion (KDEF): Identification under different display-duration conditions. Behavior Research Methods, 40, 109-115. https://doi.org/10.3758/BRM.40.1.109.
Goeleven, E., De Raedt, R., Leyman, L., & Verschuere, B. (2008). The Karolinska Directed Emotional Faces: A validation study. Cognition and Emotion, 22, 1094-1118. https://doi.org/10.1080/02699930701626582.
Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, H.J., Hawk, S.T., & van Knippenberg, A. (2010). Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 24, 1377-1388. https://doi.org/10.1080/02699930903485076.
Mollahosseini, A., Hasani, B., & Mahoor, M.H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10, 18-31. https://doi.org/10.1109/TAFFC.2017.2740923.
Olszanowski, M., Pochwatko, G., Kuklinski, K., Scibor-Rylski, M., Lewinski, P., & Ohme. R.K. (2015). Warsaw set of emotional facial expression pictures: A validation study of facial display photographs. Frontiers in Psychology, 5, 1516. https://doi.org/10.3389/fpsyg.2014.01516.
Wingenbach, T.S.H., Ashwin, C., & Bronson, M. (2016). Validation of the Amsterdam Dynamic Facial Expression Set – Bath Intensity Variations (ADFES-BIV): A set of videos expressing low, intermediate, and high intensity emotions. PlosOne, 11(1), e0147112. https://doi.org/10.1371/journal.pone.0147112.
Calder, A.J., Young, A.W., Keane, J., & Dean, M. (2000). Configural information in facial expression perception. Journal of Experimental Psychology: Human Perception and Performance, 26, 527-551. https://doi.org/10.1037/0096-1523.26.2.527.
Palermo, R., & Coltheart, M. (2004). Photographs of facial expression: Accuracy, response times, and ratings of intensity. Behavioral Research Methods, Instruments, & Computers, 36, 634-638. https://doi.org/10.3758/BF03206544.
Recio, G., Schacht, A., & Sommer, W. (2013). Classification of dynamic facial expressions of emotion presented briefly. Cognition and Emotion, 27, 1486-1494. https://dx.doi.org/10.1080/02699931.2013.794128.
Ali Manzen, F.M., Nashat, A.A., Abdel, R.A., & Seoud, A.A. (2021). Real time face expression recognition along with balanced FER2013 dataset using CycleGAN. International Journal of Advanced Computer Science and Applications, 12. https://doi.org/10.14569/IJACSA.2021.0120617.
Antoniadis, P., Filntisis, P.P., & Maragos, P. (2021). Exploiting emotional dependencies with graph convoluted networks for facial expression recognition [Paper presentation]. 16^th IEEE International Conference on Automatic Face and Gesture Recognition (FG-2021), Jodhpur, India. https://doi.org/10.1109/FG52635.2021.9667014.
Chen, A., Xing, H., & Wang, F. (2020). A facial expression recognition method using deep convolutional neural networks based on edge computing. IEEE Access, 8, 49741-49751. https://doi.org/10.1109/ACCESS.2020.2980060.
Ding, H., Zhou, P., & Chellappa, R. (2020). Occlusion-adaptive deep network for robust facial expression recognition [Paper presentation]. IEEE International Joint Conference on Biometrics (IJCB), Houston, TX. https://doi.org/10.1109/IJCB48548.2020.9304923.
Fard, A., & Mahoor, M.H. (2022). Ad-Corre: Adaptive correlation-based loss for facial expression recognition in the wild. IEEE Access, 10, 26756-26768. https://doi.org/10.1109/ACCESS.2022.3156598.
Farzaneh, A.H., & Qi, X. (2021). Facial expression recognition in the wild via deep attentive center loss [Paper presentation]. IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI. https://doi.org/10.1109/WACV48630.2021.00245.
Jiang, J., & Deng, W. (2022). Disentangling identity and pose for facial expression recognition. IEEE Transactions on Affective Computing, 13, 1868-1878. https://doi.org/10.1109/TAFFC.2022.3197761.
Khaireddin, Y., & Chen, Y., (2021). Facial emotion recognition: State of the art performance on FER2013. arXiv, 2105.03588v1. https://doi.org/10.48550/arXiv.2105.03588.
Li, Y., Wang, M., Gong, M., Lu, Y., & Liu, L. (2023). FER-former: Multi-modal transformer for facial expression recognition. arXiv, 2303.12997v1. https://doi.org/10.48550/arXiv.2303.12997.
Kollias, D., Cheng, S., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). Deep neural network augmentation: Generating faces for affect analysis. International Journal of Computer Vision, 128, 1455-1484. https://doi.org/10.1007/s11263-020-01304-3.
Kumar, V., Rao, S., & Yu, L. (2020). Noisy student training using body language dataset improves facial expression recognition. In A. Bartoli & A. Fusiello (Eds.), Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science (vol. 12535). Springer, Cham.
Pham, L., Vu, T.H., & Tran, T.A. (2021). Facial expression recognition using residual masking network [Paper presentation]. 2020 25^th International Conference on Pattern Recognition (ICPR), Milan, Italy. https://doi.org/10.1109/ICPR48806.2021.9411919.
Rajan, S., Chenniappan, P., Devaraj, S., & Madian, N. (2020). Novel deep learning model for facial expression recognition based on maximum boosted CNN and LSTM. IET Image Processing, 14, 1373-1381. https://doi.org/10.1049/iet-ipr.2019.1188.
Vo, T., Lee, G., Yang, H., Kim, S. (2020). Pyramid with super-resolution for in-the-wild facial expression recognition. IEEE Access, 8, 131988-132001. https://doi.org/10.1109/ACCESS.2020.3010018.
Johnson, J.M., & Khoshgoftaar, T.M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 27. https://doi.org/10.1186/s40537-019-0192-5.
Barsoum, E., Zhang, C., Canton Ferrer, C., & Zhang, Z. (2016). Training deep networks for facial expression recognition with crowd-sourced label distribution. arXiv, 1608.01041v2. https://doi.org/10.48550/arXiv.1608.01041.
Roy, S., Roy, C., Ethier-Majcger, C., Fortin, I., Belin, P., & Gosselin, F. (2007). STOIC: A database of dynamic and static faces expressing highly recognizable emotions. Unpublished manuscript, University of Montreal. http://mapageweb.umontreal.ca/gosselif/sroyetal_sub.pdf.
Sneddon, I., McRorie, M., McKeown, G., & Hanratty, J. (2012). The Belfast Induced Natural Emotion Database. IEEE Transactions on Affective Computing, 3, 32-41. https://doi.org/10.1109/T-AFFC.2011.26.
Tcherkassof, A., Dupre, D., Meillon, B., Mandra, N., Dubois, M., & Adam, J. (2013). DynEmo: A video database of natural facial expressions of emotions. International Journal of Multimedia & Its Applications, 5, 61-80. https://doi.org/10.5121/ijma.2013.5505.
Cosker, D., Krumhuber, E., & Hilton, A. (2011). A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling [Paper presentation]. 2011 International Conference on Computer Vision, Barcelona, Spain. https://doi.org/10.1109/ICCV.2011.6126510.
Li, S., Deng, W., & Du, J. (2017). Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild [Paper presentation]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI. https://doi.org/10.1109/CVPR.2017.277.
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., & Ambadar, Z. (2010). The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specific expression [Paper presentation]. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, San Francisco, CA. https://doi.org/10.1109/CVPRW.2010.5543262.
Yin, L., Chen, X., Sun, Y., Worm, T., & Reale, M. (2008). A high-resolution 3D dynamic facial expression database [Paper presentation]. 2008 8^th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, Netherlands. https://doi.org/10.1109/AFGR.2008.4813324.
Banziger, T., Mortillaro, M., & Scherer, K.R. (2012). Introducing the Geneva Multimodal Expression Corpus for Experimental Research on Emotion Perception. Emotion, 12, 1161-1179. https://doi.org/10.1037/a0025827.
Kaulard, K., Cunningham, D.W., Bulthoff, H.H., & Wallraven, C. (2012). The MPI facial expression database – a validated database of emotional and conversational facial expressions. PlosOne, 7(3), e32321. https://doi.org/10.1371/journal.pone.0032321.
Barakova, E.I., Gorbunov, R., & Rauterberg, M. (2015). Automatic interpretation of affective facial expressions in the context of interpersonal interaction. IEEE Transactions on Human-Machine Systems, 45, 1-10. https://doi.org/10.1109/THMS.2015.2419259.
Limami, F., Hdioud, B., & Thami, R.O.H. (2024). Contextual emotion detection in images using deep learning. Frontiers in Artificial Intelligence, 7, 1386753. https://doi.org/10.3389/frai.2024.1386753.
Shin, S., Kim, D.Y., & Wallraven, C. (2022, November 7-11). Contextual modulation of affect: Comparing humans and deep neural networks [Paper presentation]. ICMI ’22 Companion, Bengaluru, India.
Yang, V., Srivastava, A., Etesam, Y., Zhang, C., & Lim, A. (2023). Contextual emotion estimation from image captions [Paper presentation]. 11^th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, MA.
Roets, A., & Van Hiel, A. (2011). Item selection and validation of a brief, 15-item version of the Need for Closure Scale. Personality and Individual Differences, 50, 90-94. https://doi.org/10.1016/j.paid.2010.09.004.