Synthetic data in AI

Synthetic data: fake is the new real

Digitally created synthetic data helps train AI models even for unprecedented conditions. 


In brief

  • Synthetic data is a solution to the time- and resource-heavy issues of gathering real-world data.
  • It allows customization, avoiding privacy concerns, and faster turnaround of product testing.
  • Synthetic data can be used to simulate all permutations and combinations possible, especially in edge cases. 

A leading US-based tech company that makes AI-based chips for automobiles wanted to train its deep-learning networks for autonomous vehicle applications, such as object detection, safety monitoring, lane keeping and parking. Getting quality data (and avoiding poor data) from live tests on real cars and roads and analyzing that data would have meant gathering and curating thousands of pictures. Even that would have been inadequate, and expensive. The team found an alternative. It created thousands of data sets using micro-simulations of cars driving on virtual streets modeled on real-world data, such as road conditions, vehicle specifications, weather, and even dangerous or rare conditions. 

Many organizations are turning to synthetic data — data generated digitally using algorithms. Collecting and labeling high-quality data from the real world is heavy on time and resources, apart from being plagued by unavailability, inconsistency and bias. Moreover, real data sets may not contain all permutations and combinations possible, especially in edge cases. Synthetic data solve these problems to a great extent.
 

Since it possesses the same predictive power as real data, synthetic data replicates the statistical characteristics and patterns of an existing data set by modeling its probability distribution and sampling it out. It can be generated for unseen conditions and events. Synthetic data can train AI models, test systems, and build prototypes when actual data sets lack quality, volume, or variety. It allows customization, avoiding privacy concerns and faster turnaround for product testing.
 

Generated either from real data sets or created using existing models, synthetic data is propelling AI. According to Gartner, by 2024, 60% of the data used to develop AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated. 
 

Generating  data
 

Two simultaneous trends are driving the demand for synthetic data. While there is a need for large amounts of clean data to build and train AIML models, generating high-quality synthetic data has become simpler and easier. Coupled with that, data privacy rules are driving up the need for de-identified data to create large aggregate databases that can be used for more accurate analytics and AI models. Groups, the AIML community, businesses and the government agencies are adopting different data synthesis to support model building, application development and data dissemination.
 

Generally, a synthetic dataset that includes binary, numerical and categorical data, or unstructured data like images and video, is generated by deep learning generative models like Generative Adversarial Network (GAN), Variational Auto Encoder (VAN) models and diffusion models. Lately, transformer models are also gaining popularity in natural language processing. Users can either choose data that is fully or partially synthetic (where only the sensitive information from the original data is hidden). Compared to rule-based test data, synthetic test data is easier to generate, which also reduces the cost of generating training data. 

Providing insights: from healthcare to fashion
 

Data generated through simulation environments allow users to conduct “what if” analyses and design new test scenarios. This is particularly useful when no real data is available. During the COVID-19 pandemic, many of the AI models that healthcare professionals and researchers used required advanced computation. Researchers used large quantities of synthetic data that was based on actual patient data but not directly derived from individual records. Synthetic data were also used to study the spread and impact of the pandemic over time across densely tested geographic areas. 
 

Use cases are emerging in various sectors, like financial services, software testing, pharmaceuticals, manufacturing and distribution, retail, fashion and others. For instance, banks and financial services companies can use synthetic data to evaluate potential market behaviors, design algorithms for more equitable loan distribution, combat financial fraud and make new products and services.
 

In the pharmaceutical industry, synthetic data is useful when handling large but sensitive samples, where regulatory restrictions and data privacy is a challenge. It enables faster and better trials as well as cross-border research.
 

In agriculture, digitally generated data can be helpful in developing computer vision applications for crop yield prediction, crop disease detection, identifying fruits and predicting plant growth models.
 

Natural language processing is an area where synthetic data is used widely, especially while training systems of virtual voice assistants. In manufacturing, synthetic data is used to train AIML for industrial robots to enable factory automation and for robots to perform complex tasks in the production line. Artificially generated data sets can train AI in autonomous check-out systems, study customer demographics, or run cashier-less retail stores. Apart from these, advanced ML models trained on synthetic data help e-commerce companies in improving warehousing and inventory management.
 

Synthetic data has multiple use cases and solves many of the problems associated with real-world data. It is, however, not a one-stop solution. There are significant risks and limitations, since the quality of data generated largely depends on the quality of the model that created it. This means that biases can still exist, and it can get obsolete quickly. However, advances in synthetic data generation will boost the accuracy of ML models and accelerate AI. Used with due caution, it has the potential to make the software more trustworthy as well as transform the economics of data.

Summary

Synthetic data generated using algorithms can train AI models, test systems, and build prototypes when actual data sets lack quality, volume, and/or variety. It has the same predictive power as real data and is currently propelling AI and solving many of the problems associated with real-world data. There are multiple use cases across sectors, including financial services, automobiles, retail, healthcare, and pharmaceuticals. The challenge is that the quality of synthetic data generated depends on the quality of the model that created it.

About this article

Related content

Tech Trends: Synthetic data: artificial data; real solutions

Discover the real solutions artificial, synthetic data provides in EY's Tech Trends podcast. Unlock the power of synthetic data. Listen today.

6m 24s

Water 4.0: digital journey of water

Discover how digital applications including AI, sensors, and data analytics are helping industries, institutions, and governments manage their water treatment plants and distribution networks.

How AI and automation make data centers greener and more sustainable

Data centers globally are leading the charge in green technology adoption. Learn more about Here's how data centers are going green using AI.