EY refers to the global organization, and may refer to one or more, of the member firms of Ernst & Young Global Limited, each of which is a separate legal entity. Ernst & Young Global Limited, a UK company limited by guarantee, does not provide services to clients.
How EY can help
-
Discover how EY's analytics consulting services can help you apply analytics throughout your organization to help grow, protect and optimize your business.
Read more
Many organizations are turning to synthetic data — data generated digitally using algorithms. Collecting and labeling high-quality data from the real world is heavy on time and resources, apart from being plagued by unavailability, inconsistency and bias. Moreover, real data sets may not contain all permutations and combinations possible, especially in edge cases. Synthetic data solve these problems to a great extent.
Since it possesses the same predictive power as real data, synthetic data replicates the statistical characteristics and patterns of an existing data set by modeling its probability distribution and sampling it out. It can be generated for unseen conditions and events. Synthetic data can train AI models, test systems, and build prototypes when actual data sets lack quality, volume, or variety. It allows customization, avoiding privacy concerns and faster turnaround for product testing.
Generated either from real data sets or created using existing models, synthetic data is propelling AI. According to Gartner, by 2024, 60% of the data used to develop AI and analytics projects will be synthetically generated.
Generating data
Two simultaneous trends are driving the demand for synthetic data. While there is a need for large amounts of clean data to build and train AIML models, generating high-quality synthetic data has become simpler and easier. Coupled with that, data privacy rules are driving up the need for de-identified data to create large aggregate databases that can be used for more accurate analytics and AI models. Groups, the AIML community, businesses and the government agencies are adopting different data synthesis to support model building, application development and data dissemination.
Generally, a synthetic dataset that includes binary, numerical and categorical data, or unstructured data like images and video, is generated by deep learning generative models like Generative Adversarial Network (GAN), Variational Auto Encoder (VAN) models and diffusion models. Lately, transformer models are also gaining popularity in natural language processing. Users can either choose data that is fully or partially synthetic (where only the sensitive information from the original data is hidden). Compared to rule-based test data, synthetic test data is easier to generate, which also reduces the cost of generating training data.