The Subtlety of Variations

Similar models trained on the same data, providing different results in real-world scenarios is a problem – here’s how

With artificial intelligence now at the forefront of technology and evolution, we’ve observed quite a transformation in trend over the past few years. AI and machine learning have gone from being ‘up-and-coming’ technologies to the ones seeing most widespread usage all over the world today. Yet, this too has come with its caveats. Like almost any other new technology, there’s often a gap to be found between laboratory results and real-life scenarios. The more we have started applying AI technologies to real-life situations, the more issues we have encountered. The challenge for now – and the immediate future – will be the mitigation of these issues to be able to reap the bountiful harvest of the AI crop for years to come.

Is today’s AI ‘fundamentally flawed’?

An article featured in the MIT Technology Review recently called into question the way we train our AI models – calling the process ‘fundamentally flawed’ and citing a major gap between lab-trained models and real-life results. In laboratory settings, machine learning models are often tuned and tweaked to near-perfect performance – which makes its failure in real-life situations all the more alarming.

According to most trouble-shooters, this can essentially be brought down to a phenomenon called ‘data shift’ –that is, a major mismatch noted between the data that the AI was trained and tested on in laboratory conditions and data encountered in real-life situations. An example is Google Medical AI, an AI suite designed to spot signs of disease in high-quality medical images. While it performed admirably with the high-quality images provided in training, it would often struggle with cropped or blurry images from cheap cameras sent in real-life situations, such as from a busy clinic.

Keyword: Underspecification

Apart from this, researchers from Google have found another major cause for the failure of machine learning models – ‘under-specification’ – an issue touted to be even more problematic and widespread than data shift. Under-specification essentially refers to the inability to attribute causality to a factor, or for a model having several other factors that are not accounted for. Researchers, having found the presence of under-specification in a range of AI applications from Natural Language Processing (NLP) to image recognition to disease prediction, are now in the process of making the conditions for a model’s viability more stringent.

The researchers studied the impact of under-specification on various different applications through a wide range of experiments. The same training processes were used in each case, with stress tests being used to better highlight differences. In one of the experiments, 50 different versions of everyday objects were trained on an image recognition model. Whilst all of them performed more or less similar in the training test – more or less accurately as well, their stress test performances turned out to be vastly different.

“The stress test used ImageNet-C, a dataset of images from ImageNet that have been pixelated or had their brightness and contrast altered, and ObjectNet, a dataset of images of everyday objects in unusual poses, such as chairs on their backs, upside-down teapots, and T-shirts hanging from hooks. Some of the 50 models did well with pixelated images, some did well with the unusual poses; some did much better overall than others. But as far as the standard training process was concerned, they were all the same.” (MIT Tech Review)

And this is the primary cause for concern: models developed similarly and of a similar prowess, which should ideally be relaying similar results, were performing vastly differently when fed real-world data.

In statistical machine learning, a model is first tested on large ‘training’ data sets, and then on other scenarios which presumably follow similar-enough patterns for the model to make adequately representative predictions. There are certain statistical criteria a model must pass, and upon doing so, it is said to be good to go. New research is, however, going against this, stating that the bar is too low.

According to the MIT Tech Review, “The training process can produce many different models that all pass the test but—and this is the crucial part—these models will differ in small, arbitrary ways, depending on things like the random values given to the nodes in a neural network before training starts, the way training data is selected or represented, the number of training runs, and so on. These small, often random, differences are typically overlooked if they don’t affect how a model does on the test. But it turns out they can lead to huge variation in performance in the real world.

In other words, the process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t.”