Companies Could Run Out of Data to Train AI, We've Gathered Some Solutions

2023.02.06 01:41 PM - By Joshua Taddeo, Principal Consultant

Language models are a burgeoning area of AI research. Competition is heating up to release programs like Chat GPT-3, which can produce coherent articles and computer code with remarkable fluency. However, there is a looming concern that threatens to impede the progress of this field: the potential depletion of data used to train these models.

Language models are trained using texts sourced from various data sets such as scientific papers, news articles, books, and Wikipedia. Researchers aim to create more versatile and accurate models, so they have been using increasingly larger amounts of data. According to a recent study by researchers from Epoch, the data used for training AI programs may be exhausted as early as 2026. This means that as researchers continue to build more advanced models, they will need to find new sources of data or make their models more efficient and effective with the data sources they have.

Potential Solutions for Training AI with Limited Data

The limited availability of data used to train AI models can hinder the potential value of artificial intelligence, especially in niche focus areas. Running out of data for training artificial intelligence (AI) model is a common challenge for many companies, especially startups. However, there are solutions that can help improve the training model. One solution is to find creative ways to train AI models multiple times using the same data, improving their efficiency and ability to produce better results. Another solution is to use synthetic data, which is data applicable to a situation that isn’t obtained by direct measurement.

Improving Results with Existing Training Data

There are several ways that AI programmers can improve the efficiency and effectiveness of their model without increasing the training data set. One approach is to use techniques such as transfer learning, which allows a model trained on one task to be applied to a different but related task. This can significantly reduce the amount of data required to train a new model, as the model can leverage the knowledge it has already acquired.

‍

Another approach is to use different model architectures and optimization algorithms to improve the performance of the model. For example, using a more complex model architecture, such as a deep neural network or a transformer, can help to improve the model’s ability to capture complex patterns in the data. Additionally, using advanced optimization techniques such as adaptive learning rates or gradient descent can help to fine-tune the model to better fit the data.

‍

Researchers believe it may be possible to train AI models multiple times using the same data by fine-tuning the model’s parameters based on the specific task or data it is being applied to. This can help to improve the model’s performance by allowing it to learn from the data more effectively and adapt to different scenarios. The following are specific approaches to improving results using the same training data.

Feature engineering: The process of identifying and selecting relevant features in the data to train the model, with the goal of using fewer data to achieve good performance. It involves transforming raw inputs into meaningful features that can be used as input to an AI algorithm or model, which helps in extracting useful information from large datasets without increasing their size.

Model architecture: The structure of the neural network can also affect the model’s performance. Researchers can experiment with different architectures, such as using deeper networks or using different types of layers, to improve the model’s ability to learn from the available data.

Regularization: A technique used to reduce overfitting in machine learning models by adding a penalty for complexity to the cost function. This can help ensure that the model does not learn from noise or randomness in the training data. Two popular regularization techniques are L1 and L2 regularization, which can add a penalty term to the loss function to discourage the model from assigning too much weight to any one feature.

Data augmentation: This technique involves artificially increasing the size of the training dataset by applying various transformations to the existing data. An example of this can include flipping images, rotating them, and adding noise to expand the diversity of potential inputs the model may see in the real world.

Transfer learning: A technique in which a model trained on one task is used as a starting point for the model on a second task. This can be particularly useful when there is a limited amount of data available for the new task.

Ensemble methods: Combining predictions from multiple models to improve performance. This can be achieved by either averaging the predictions of multiple models or by training a separate model to make predictions based on the predictions of the individual models. This approach can help to improve accuracy and reduce errors without the need for additional training data.

Using Synthetic Data to Train AI

Synthetic data is computer-generated data that is designed to mimic the characteristics of real data. It is often used when collecting real-world data is too costly, time-consuming, difficult, or unethical. Synthetic data can also be used to create realistic test scenarios to enable organizations to efficiently and effectively test their applications without compromising security or privacy. This data can be generated using a variety of techniques, including computer simulations, generative models, and statistical methods. It can also be used to augment or supplement real-world data, for example, to increase the diversity of training data or to create additional examples of rare events.

Waymo, Alphabet’s autonomous driving company, is a clear example of improved results using synthetic data. Instead of relying solely on their real-world vehicles, they’ve created a completely fabricated world where simulated cars with sensors could drive around endlessly, collecting real data on their way. According to the company, it collected data on about 15 billion miles of simulated driving compared to just 20 million miles of real-world driving. Essentially, companies are generating fake data so these machines can learn about the real world at a faster pace than what’s possible with the existing access data sets could provide.

Machine-learning algorithms are currently not as adaptable as human intelligence, and applying them to a new problem generally requires new training data specific to that situation. Startups building artificial intelligence software don’t always have access to millions of data points like the major players in artificial intelligence. While there are free or affordable data sets from universities and other public institutions, a company may also need to build the data it uses to train its algorithms, especially when the available data doesn’t align with the model’s requirements to produce the intended outcomes.

To incorporate synthetic data into your AI models, we recommend the following process. It’s also important to note that synthetic data generation is an iterative process, and these steps should be repeated multiple times until the desired performance is achieved.

Data generation: Synthetic data must be generated using a variety of techniques, such as simulation, bootstrapping, or generative models.
Data cleaning and preprocessing: Like any other data, synthetic data must be cleaned and preprocessed before it can be used for training. This may include removing outliers, normalizing or scaling the data, and removing any irrelevant features.
Model selection and training: A suitable model for the task must be selected and trained using synthetic data. This may include selecting the appropriate algorithm and tuning its hyperparameters.
Model evaluation: The trained model must be evaluated using metrics such as accuracy, precision, recall, and F1 score to ensure that it performs well on the synthetic data.
Model fine-tuning: If necessary, the model can be fine-tuned by adjusting its hyperparameters, adding regularization, or using ensemble methods.
Data Merge, Retrain, Reevaluate, and Fine-Tune: Once you’re getting results from your synthetic data, it’s time to merge it with real-world data sets to identify the outcomes before putting it into production. Retraining, reevaluating, and fine-tuning with the merged data sets is vital to understand how the synthetic data affects your real-world outcomes.
Beta Testing: As with all programs, before the improvements hit full production, they should be beta tested by real-world users to ensure the outcomes are consistent with the tests and in line with expectations. If there’s a way to generate an unexpected outcome from a program, users have a tendency to stumble upon it, often by using the program in unexpected ways.
Deployment: Once the model has been trained, fine-tuned, and beta tested, it can be deployed in a production environment.
Governance: Even if the outcomes in deployment seem positive, it’s important to consistently monitor the program after deployment. This is even more vital in AI programs as the world changes constantly and may no longer reflect the training data the program was built on.

If you don’t regularly update the training data for an AI program, the model may become outdated, and its performance may decline. As the real-world conditions or the data distribution change, the model may not be able to generalize well to new data, resulting in poor performance on unseen examples. Additionally, if the model is used in a safety-critical or mission-critical application, not updating the training data can be a significant risk. It could lead to the model making incorrect decisions that could have serious consequences. Furthermore, the model may also be prone to bias and discrimination, as the training data may not reflect the diversity of the population it is intended to serve.

Why Fake Data Might Be Better

Synthetic data, however it’s produced, offers a number of concrete advantages over using real-world data. First, it’s easier to collect more data since you won’t have to rely on the necessary timelines or potential biases in real-world data. Synthetic data can be programmed to include the label, so you won’t need to use labor-intensive data teams to label data. It can also protect copyright and privacy claims because the data source didn’t come from an individual who can sue the company. And finally, it can significantly reduce biased outcomes.

With AI playing a larger role in society and technology, expectations around synthetic data are quite optimistic. Gartner has estimated that 60% of training data for AI will be synthetic data by the year 2024. Market analyst Cognilytica valued the market for synthetic data generation at $110 million and to grow to $1.15 billion by 2027. Data is perhaps the most valuable commodity in today’s digital age. Big tech has sat on heaps of user data that gave it a unique advantage over smaller contenders within the AI space. Synthetic data is projected to provide smaller players the opportunity to eventually turn the tables.

As you may suspect, the big question regarding synthetic data is revolved around how closely it resembles real-world data. While the jury is still out on this, research does seem to show that incorporating synthetic data with real data provides similar results. Researchers from MIT showed that an image classifier that was pre-trained on synthetic data along with real data performed just as well as an image classifier trained exclusively on real data.

The Potential Dangers of Synthetic Data

Unfortunately, there are specific dangers in utilizing either solution related to the potential for changes in data or additions in data to have the opposite of the intended effect. One potential danger of using synthetic data to train AI models is that the models may not be able to generalize well to real-world situations. This is because synthetic data is often generated by models or algorithms, and may not accurately reflect the complexity and diversity of real-world data. Additionally, synthetic data can be biased, either intentionally or unintentionally, leading to biased models that make unfair or inaccurate predictions.

Another potential danger is that synthetic data can be used to train models that are able to cheat on benchmarks or evade detection. This can be a problem in safety-critical applications, such as self-driving cars, where models need to be robust and reliable. Using synthetic data can also lead to overfitting, which occurs when a model becomes too specialized to its training data and performs poorly on new data. This can happen when synthetic data is generated using the same model or algorithm that is being trained, which creates a positive feedback loop that leads to models that are too complex and unreliable.

Another potential danger is that synthetic data may not have the same level of privacy and security as real-world data. This can be a problem in sensitive applications, such as medical research or financial analysis, where data privacy is critical. Moreover, synthetic data may not be representative of real-world data distribution, which can cause models to perform poorly in real-world scenarios. This can happen when synthetic data is generated using a limited set of parameters or assumptions that do not accurately reflect the real-world data distribution.

Finally, while it may be more affordable to create synthetic data rather than the time and effort it would take to build real-world data, it requires specialized skills that your team may not have. The time and energy required to build synthetic data may not be a good use of your time if your company doesn’t need to do this regularly. There’s no shortage of money or vendors in the market to help optimize data for machine learning, be it data curation or data labeling. DefinedCrowd has raised $50.5 million to date to help advance its data curation vision.

Mitigating Synthetic Data Challenges through Better Data Curation

Every machine learning model is bounded by one critical component: the quality of data on which the model is trained. When you’re using synthetic data, the curation becomes even more vital to avoid the dangers mentioned above. Consider a scenario where you’re building a self-driving car. If your synthetic data only includes scenarios with perfect lighting conditions, the car will perform poorly in the real world with ever-changing lighting conditions. It’s important to consider all possible scenarios that the AI model will be presented with to ensure a robust and dynamic model.

A 2021 MIT research study found systemic issues in how training data was labeled, leading to inaccurate outcomes in AI systems. A study in the journal Quantitative Science Studies that analyzed 141 prior investigations into data labeling found that 41% of models were using datasets labeled by humans. While labeled data usually equates with improved results, datasets can — and do — contain errors. The processes used to build them are inherently error-prone, which becomes problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress. These errors could lead scientists to draw incorrect conclusions about which models perform best in the real world, undermining benchmarks.

There are several approaches to training data for AI machine learning. Within the supervised approach, there’s an image along with an associated label used to teach a model while a human is required to perform the labeling. Many companies, like Lightly, have taken a self-supervised approach to machine learning on AI-generated images. In that case, there often isn’t a need for human interaction though it’s hard to be certain of the program’s outcomes without a validation process like supervised approaches. This self-supervised model requires minimal human interaction allowing the AI to curate and label data as the model learns to discern patterns from the information it’s given.

Utilizing All Solutions Simultaneously Whenever Possible

As AI becomes increasingly widespread in businesses and industries, the demand for reliable training data to feed AI algorithms has become a crucial factor in its effectiveness. There are two primary methods to enhance AI results without using new real-world data: by optimizing the efficiency of AI with existing data sets or by creating synthetic data sets.

Synthetic data is becoming popular as it can generate large datasets quickly and has the potential to capture patterns not found in real-world datasets, leading to higher accuracy. However, synthetic data also has dangers, such as overfitting and bias. Proper curation of synthetic data, such as ensuring diverse samples and validation against ground truths, can mitigate these challenges. Validation pipelines or automated tests ensure that generated samples are diverse enough so as not to introduce biases into any machine learning models developed; validating generated samples against known ground truths; or creating validation pipelines or automated tests which measure the quality of generated datasets before they are used in production environments.

As in many cases with business, if you have the resources to incorporate both alternatives, the outcomes are often best. It is clear that there are advantages and disadvantages to both synthetic data and real-world data when it comes to training AI. Synthetic data is easier and faster to generate, but real-world data can be more accurate and representative of the target population.

Ultimately, the most effective approach to training AI is to utilize a combination of both methods. Synthetic data can be used to create models quickly, while real-world data can be used to validate the accuracy of the resulting models. By leveraging the distinct advantages of both synthetic and real-world data sets, organizations can achieve the greatest efficiency improvements when training AI. As always, you should measure and test the quality of the output from your dataset changes before use in any production environment.