Data access is missing or delayed, and 81% want synthetic data now

AdminAugust 21, 2023

7 5 minutes read

(NewsUSA) – Companies are looking to jump on the AI hype to build their own ChatGPTs and improve processes with artificial intelligence. Yet, a recent internal survey by a global telecommunications company found that accessing relevant data that would serve as the backbone of any AI engine is a top challenge. 80% say they don’t have fast enough access to data, with typical lead times ranging from one to six months.

In another public survey conducted by synthetic data company MOSTLY AI, 28% of AI and machine learning developers cited a lack of data access as the reason for failed AI projects. Privacy and governance issues were cited in 35% of the cases. Some companies are reluctant to embrace generative AI because of data privacy issues, and understandably so. It’s a bit of a catch-22 situation. Companies can’t use AI models developed by others because they want to protect their data. They can’t develop their own AI models either due to data privacy concerns and possible future regulatory issues. Does that mean the AI revolution is over before it even started?

Far from it. While some of the most aggressive players, like OpenAI and Microsoft, forge ahead despite the data privacy and intellectual property issues flagged by regulators, artists, and consumers alike, others look for tools that enable them to develop AI engines with privacy and fairness in mind. A new generation of so-called privacy-enhancing technologies, or PETs are offering new ways to get data.

PETs could save the day

Synthetic data generation stands out as one of the most advanced and innovative privacy-enhancing technologies (PETs) available today. By harnessing the power of synthetic data, businesses can overcome the hurdles posed by privacy concerns and data protection regulations. Artificially generated with intelligence gleaned from sample datasets, synthetic data provides a valuable resource for companies seeking to optimize their AI engines and improve processes through artificial intelligence.

With the aid of synthetic data, companies can gain unprecedented insights into the correlations, trends, and patterns present in their customers’ data without risking their privacy. Synthetic generators create proxies that faithfully mimic real data, allowing for in-depth analysis and experimentation without compromising the privacy of individual users. This unique capability empowers businesses to explore novel solutions and even use data to fine-tune their AI models, enhancing the accuracy and reliability of their predictions and decision-making processes.

Synthetic data also serves as a powerful tool for data augmentation, effectively addressing imbalances found in real-world datasets, such as those related to gender or race. By employing synthetic data in tandem with genuine data, companies can rectify biases and promote fairness in their AI applications, paving the way for more equitable outcomes.

Over 80% of those working with data report wanting to work with synthetic data that is matching the patterns of real data without the privacy-sensitive data points. Similarly, 71% of respondents in the State of Synthetic Data Survey agreed that synthetic data is the missing piece of the puzzle required for AI/ML projects to succeed.

Seventy-two percent of respondents plan to use an AI-powered synthetic data generator within the next few years, and almost 40% plan to use one in the next three months. And 46% of practitioners cite data augmentation – a process capable of fixing imbalances of real data, such as gender or race imbalances – as their main use case. Although excitement is high, the survey also highlighted a heightened need for educating the data community about the benefits, limitations, and use cases of synthetic data.

Misconceptions about synthetic data are widespread, even among AI/ML experts

There is still a lot of confusion around the term “synthetic data”; 59% of respondents didn’t know the difference between dummy mock data and intelligent synthetic data. This suggests that synthetic data companies have a huge responsibility to educate data consumers and learn firsthand what it’s like to work with synthetic versions of real datasets.

Tobi Hann, MOSTLY AI’s CEO, said, “We have to educate people big time. Since we work with synthetic data day in and day out, we take a lot of related knowledge for granted, and only when conversations get to a deeper level do we realize that sometimes even engineers have fundamental misunderstandings about the way synthetic data generation works and the use cases it is capable of solving. Our number one priority is to get people hands-on with synthetic data technology, so they really learn the capabilities in their day-to-day tasks and might even discover new ways of working with synthetic data that we didn’t think about.”

The synthetic data potential

When asked about the most frequently used data anonymization tools and techniques, 49% of respondents said that they use data masking to anonymize data. Twenty percent said they simply remove personal information from datasets – an approach that is not only unsafe from a privacy perspective, but can also destroy data utility needed for high-quality training data. Privacy-enhancing technologies, like homomorphic encryption, AI-generated synthetic data, and others, account for 31%.

Without a doubt, the AI revolution is underway, and it is in all of our interests to make sure that companies develop these data-hungry beasts in privacy-respecting ways. Synthetic data – due to its malleable and shareable nature – is also the key to AI explainability. When AI creates or predicts something, it has no insight into why it decided to do what it did. The concept of explainable AI tries to solve this black box by giving access not only to the model but also to the data that was used to train the system. Synthetic training data functions as the window into the souls of algorithms, allowing developers, regulators, and even consumers to understand how the model came to the conclusion it did.

Algorithmic decisions have been impacting our lives before, but with the recent AI breakthroughs, cases like a woman being given less credit on the basis of her gender by a biased algorithm or a man getting arrested due to a facial recognition algorithm misidentifying him, are on the rise. It turns out that data access is important not only to those who create AI but for those who are subjected to its decisions. By leveraging synthetic data, businesses can embark on a path of responsible AI development, where data privacy is preserved, biases are mitigated, and AI models are thoroughly understood and trusted. Embracing synthetic data represents a critical step towards realizing the full potential of AI while upholding ethical standards and ensuring a positive impact on society.

About MOSTLY AI

MOSTLY AI is the pioneering leader in the creation of structured synthetic data. It enables anyone to generate high-quality, production-like synthetic data for smarter AI and smarter testing. Synthetic data teams at Fortune 100 companies and others can originate, amend, and share datasets in ways that overcome the ethical challenges of using real, anonymized, or dummy data. AI-generated synthetic data is private, provides a reduction in time-to-data, and puts more machine learning models into production. For press inquiries, email hello@mostly.ai.