A Synthetic Solution

How the use of Synthetic Data could prove to be a saving grace for AI

The economic disruption caused due to the COVID-19 pandemic affected several global industries adversely: perhaps none more so than the Autonomous Vehicle (AV) industry. With most fleets grounded, the chance to accrue data over real-world miles in order to streamline internal mechanics and perception capabilities was considerably narrowed. This thereby had the potential to massively impede progress in the AV industry in achieving greater levels of autonomy. Yet, it did not.

AV Executives turned to the increased usage of synthetic data and simulations instead, such as with high-profile tie-ups with game engine makers Unity and the ‘Carcraft’ simulation platform Waymo. According to tech news conglomerate VentureBeat, “Synthetic video datasets leveraging advanced gaming engines can be created with hyper-realistic imagery to portray all the possible eventualities in an autonomous driving scenario, whereas trying to shoot photos or videos of the real world to capture all these events would be impractical, maybe impossible, and likely dangerous. These synthetic datasets can dramatically speed up and improve training of autonomous driving systems.”

Image: Synthetic images being used to trains Automated Vehicles; Source: US-based Synthetic-data platform Parallel Domain

This forced pivot not only proved to be a major boon to the industry but also facilitated the timely transition from real-world data to computer-generated synthetic data.

Choose wisely: Choose Synthetic?

It is not just the AV industry, of course. Synthetic data, according to many, may even have the potential to ‘save AI’ in all its applications. This comes at a time when AI has started facing some critical challenges: the requirement of huge swathes of data to ensure (i) accuracy, (ii) unbiasedness and (iii) compliance with varied global data privacy regulations. According to VentureBeat, “we have seen several solutions proposed over the last couple of years to address these challenges — including various tools designed to identify and reduce bias, tools that anonymize user data, and programs to ensure that data is only collected with user consent. But each of these solutions is facing challenges of its own.”

In such a scenario, the emerging market trend of synthetic data usage, i.e. artificial computer-generated data to stand in for real-world data, could be the answer. It of course goes without saying that the synthetic dataset replacing the real-world one must have mathematical and statistical attributes similar to the real-world set: essentially acting as a sort of a statistically-reflective data mirror. This will allow the training of AI in completely digital environments, thereby increasing use cases in fields ranging from healthcare and retail to finance, agriculture and transportation.

Almost ironically, synthetic data development makes use of the same generative adversarial networks (GANs) that is used in the production of deepfake videos: “one network generates the synthetic data and the second tries to detect if it is real. This is operated in a loop, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.”

The Roaring (Synthetic) Data Train

According to research dating to June 2020 from Austria-based start-up research agency Start Us Insights, over 50 vendors globally have already started developing synthetic data solutions. These include nascent firms such as (i) Israeli start-up DataGen, providing “a sophisticated, photorealistic 3D reconstruction of human hands, face, body, and eyes” to aid training algorithms in medicine and biology; (ii) UK-based start-up Hazy for fraud detection in the fintech industry; (iii) Spanish start-up ANYVERSE working to create synthetic datasets of sensor-specific data for usage in image processing functions and custom LiDAR settings for the automotive industry and (iv) US-based Cvedia working on photo-realistic high-fidelity simulators to generate “entropic scenes, conditions, and metadata to enable real-time simulations”, among several others.

The major advantage of synthetic data is that it promises to push the boundaries of AI development without any major downsides. Not only does this process protect real-world personal data and corrects biases that is often engrained in it, it also works brilliantly in industries where data is otherwise difficult to obtain, making it (possibly) a major saving grace for several global big data applications and firms as well.