AI Data Quality: Garbage In, Garbage Out

Data is the foundation of any AI system, no matter how complex or performant the system. Since AI models are designed to take in data, process it, and then make decisions or predictions based on that data, the quality of the data is critical to the accuracy and reliability of the model. Poor quality data can lead to incorrect results or negative outcomes, while high quality data can provide actionable insights and meaningful predictions. But what determines data quality? There are several factors that can affect the quality of data, from quantity to security.

Table of Contents


When it comes to AI, the quantity of data is paramount. The more data a model has access to, the better it can learn and perform. Often, it is a sheer numbers game and larger models outperform smaller ones. State-of-the-art AI models are now trained on hundreds of billions of datapoints.

That said, emerging techniques are working to bypass data quantity requirements. For example, zero-shot and few-shot learning can train a model on limited data. Additionally, the advent of foundation models has brought a new perspective to the role of data in AI systems. Foundation models such as GPT-3 require a massive amount of data, as they are pre-trained on trillions of tokens. However, once these models are built, they become quite sample-efficient for downstream tasks, requiring significantly less data to adapt to specific applications. This approach allows AI systems to leverage the power of foundation models while minimizing the need for extensive data collection in downstream tasks.


Though quantity over quality is generally true for AI models, the accuracy of the data itself is still critical, as inaccurate data can lead to incorrect results. Many AI models are trained on internet data, and as we all know, the internet can be home to misinformation. It’s vital that data is drawn from credible sources.


Related to accuracy, bias occurs when the data is not representative of the target population. It can be caused by a variety of factors, including the data itself, the algorithms used to process the data, and the data pre-processing techniques used to prepare the data for the model. Bias can also be caused by the data collection process, which may be biased towards certain types of data or certain types of users.

While collecting data and developing AI models, it is important to be aware of potential sources of bias and take steps to mitigate them. Bias can lead to unfair treatment of certain groups of people, which can have serious ethical implications and tangible outcomes. For example, if an AI system used to diagnose diseases hasn’t been trained on pediatric data, it will be biased toward adults and will likely not accurately diagnose children.


It’s also important to consider how effectively the data will be able to solve for a problem or use case. A diverse dataset that represents various aspects of the problem domain can help AI models generalize better and provide more robust predictions. Ensuring that the data covers a wide range of scenarios, edge cases, and variations can improve the AI system’s ability to adapt to new or unseen data.

Get the white paper | Considering AI solutions for your business? Ask the right questions.