AI Datasets: The Importance Of Quality Data For AI Training

đź’ˇ Pro tip: Would you like to learn more about machine learning first? Check out this guide—What is Machine Learning? The Ultimate Beginner’s Guide.

And in case you are ready to start annotating, check out:

  1. V7 Model Training
  2. V7 Workflows
  3. V7 Auto Annotation
  4. V7 Dataset Management

Now, let’s get started!

What is AI training data?

Training-validation-testing data refers to the initial set of data fed to any machine learning model from which the model is created.

Just like we humans learn better from examples, machines also need a set of data to learn patterns from it.

đź’ˇ Training data is the data we use to train a machine learning algorithm.

In most cases, the training data contains a pair of input data and annotations gathered from various resources and organized to train the model to perform a specific task at a high level of accuracy.

It may be composed of raw data, such as images, text, or sound, containing annotations, such as bounding boxes, tags, or connections.

Machine learning models learn the annotations on training data, so that they may apply them to new, unlabeled examples.

For instance, here’s how you can auto-annotate your training data with V7.

Training Data in Supervised vs. Unsupervised learning

What is the difference in training data using supervised vs. unsupervised learning?

In supervised learning, humans will label data telling the model exactly what it needs to find.

For example, in spam detection, the input is any text while the label would suggest if the message is spam or not.

Supervised learning is more restrictive, as we aren’t allowing the model to derive its own conclusions from the data outside of the limits annotated by our labels.

In unsupervised learning, humans will present to the model raw data containing no labels, and models find patterns within the data. For example—recognizing how similar or different are two data points based on the common features extracted.

This helps the model derive inferences and reach conclusions, for instance—segregating similar images or into clusters.

‍Semi-supervised learning is a combination of the two learning types mentioned above, where data is partly labeled by humans with some of the predictions left to the model’s judgment.

Semi-supervised learning is often used when humans can direct the model towards the area of focus but where actual predictions become hard to annotate because they are too small or nuanced.

In reality, there is no such thing as fully supervised or unsupervised learning— there exist only various degrees of supervision.

Supervised learning: training data process

All learning methods start with the collection of raw data from different sources.

Raw data can be of any form like text, images, audio, video etc. However, to tell the model what needs to be identified in this data, you must add annotations.

These annotations allow you to supervise the learning, ensuring that the model focuses on the features you point out, rather than extrapolating conclusions from other correlated (but not causal) elements in your data.

Each input data should have a corresponding label that guides the machine towards what the prediction should look like. This processed dataset is obtained with the help of humans, and sometimes other ML models accurate enough to reliably apply labels.

Once a labeled dataset is ready to be fed to the AI, the training phase starts.

Here, the model tries to derive important features that are common across all the areas where you applied your labels. For example, if you segmented out a few cars in your images, it will learn that wheels, rear-view mirrors, and door handles are all features that correlate with “car”.

Models test themselves continuously against a validation set defined prior to training time.

Once complete, they will make a final check against testing data (a set never seen before by the model) which will give an idea of the model’s performance on relevant new examples.

Your training, validation, and test sets are all part of your training data. The more training data you have, the higher is your model’s accuracy.

Now, let’s define some of the popular terms you might encounter when dealing with machine learning training data.

💡 Pro tip: Dive deeper and check out Supervised vs. Unsupervised Learning: What’s the Difference?

What is labeled data?

Labeled data is data that comes with a tag/class that provides meaningful information.

Here are a few examples of labeled data: images with the corresponding tag of cat/dog, marking emails/messages as spam, forecasting stock prices (the future state is your label), identifying nodules to be cancerous or not with a polygon or audio files giving information of what words were spoken.

💡 Pro tip: If you are looking for a free labeling annotation tool, check out The Complete Guide to CVAT—Pros & Cons.

Accurately labeled data makes it easy for the machine to recognize patterns according to the task to predict the target and hence it is widely used in solving complex tasks.

What is human in the loop?

Human in the loop (HITL) process is when a machine learning model is only partially able to solve a problem, and part of the task is offloaded to a human agent.

Model-assisted data labeling is an example of human in the loop, where an ML model will apply initial predictions, and a human complements them with additional tags, corrections, or other types of annotations unsupported by the model.

Humans provide continuous feedback improving the performance of the model.

To begin with, humans use annotation tools to label the raw data to help the machines learn and make predictions accordingly. They validate the output of the model and check the predictions when the machine is not sure of its output to ensure that the learning of the model progresses in the right direction.

Sometimes though, humans stay forever in the loop to add more tags to data that we can’t fully rely on models for.

For example, many automated medical diagnosis systems, or identify verification systems, rely on humans in the loop to avoid leaving the final decision of an important evaluation to the machine learning algorithms.

đź’ˇ Pro tip: Check out The Ultimate Guide to Medical Image Annotation.

In this loop, machines and humans go hand in hand!

Training, Validation, and Test Sets

No AI model cannot be trained and tested on the same training data.


It’s simple—

The model’s evaluation would be biased as the model is being tested on what it has already learned. It would be like giving the same exact questions in an exam that were already answered in a class. We would not know if the student memorized the answers or actually understood the concepts.

The same rules apply to the machine learning models.

Here’s an overview of the splits.

Training data—At least 60% of your data should be used for training.

Validation data—A sample (10-20%) of the total dataset will be used for validation and checked on periodically by the model during training. This validation set should look like a representative sample of the training set.

Test data—This set of data is used to test the model after it has been completely trained. This is separate from both the training set and validation set. After the model is trained and validated, then it is tested on the testing set. The data in the test set should be unlabeled, exactly how real data would look if the model is deployed.

đź’ˇ Pro tip: Read How to Split Your Machine Learning Data: Train, Validation, Test Set Split to learn more.

You may have more than one test set in a dataset.

Each test set can be used to check whether a model generalizes to a specific scenario. For example—

An autonomous vehicle model made to detect pedestrians may be trained on videos from all over the United States.

đź’ˇ Pro tip: Check out the list of 65+ datasets for machine learning.

Its main test set might be a mix of all the state’s locations, however, you might want to create dedicated test sets for specific scenarios. These can include:

  • A test set for sunset driving
  • A test set for a snowy environment
  • A test set for driving in heavy storms
  • A test set for when the camera has a dirty lens or has been scratched.

These test sets are normally stored in a dataset management solution and are manually hand-picked by data scientists. As such, it’s paramount that you fully understand what your data looks like and appropriately tag outlier scenarios so that you may create test sets out of them.

Test sets are not used exclusively to assess AI model performances. Sometimes they are used to test our human annotator performances too.

This is known as a Gold Set.

Gold Sets—Your ideal ground truth

A selection of well-labeled images that accurately represent what perfect ground truth looks like is called a gold set.

These image sets are used as mini testing sets for human annotators, either as part of an initial tutorial, or to be scattered across labeling tasks to make sure that an annotator’s performance is not deteriorating either due to poor performance on their part, or changing instructions.

Gold sets usually check for a series of things:

  • Time to complete a task.
  • Accuracy of each annotation (by recall or IoU)
  • Performance increases with experience
  • Performance deterioration with new instruction changes

Testing continuously against gold sets is paramount to good training data. The best labeling teams in the market maintain rigorous automated tests and make use of a platform that allows them to be intelligently placed and measurable.

Blind Stages—Multiple passes by multiple annotators

Blind stages are annotation tasks where multiple humans (or models) place a label independently of one another, and the stage passes only if they all agree on the same outcome.

Blind stages are used to create ultra-accurate training data and automating quality assurance checks. It’s very common for an annotator to miss an object, but it’s far less common for two of them to do so.

Blind stages are labeled in parallel and each participant cannot see the progress of the others.

When all annotators have completed their version of the task, it goes through a consensus check that validates that the annotations agree. If they don’t, or don’t overlap enough with one another spatially, the task is sent to a human reviewer to apply corrections, and the annotator who made an error is notified so they may improve their work.

How much training data do you need?

The simple answer: Enough to represent each plausible case in your scenario with at least 1,000 data samples.

Why 1000?

If you use 10% of that as a test set, you can tell the accuracy of a class with at least 1% of an error rate.

To put things into perspective:

1,000 examples per class is a decent dataset.

10,000 is a great dataset.

100k-1 million is an excellent dataset.

More than 1 million labeled examples of something puts you on the leader board among AI teams.

Some companies are now training models on billions of images, video, and audio samples. These datasets have multiple test sets and are labeled and re-labeled multiple times to increase their scope.

Yes, theoretically you can train a model using 100 examples of something. For example, V7 allows you to do train a model with as few as 100 instances, however, these will perform rather poorly on new examples.

Great models are trained on large volumes of training data items for a good reason—modern neural network architectures work brilliantly because they can store many weights (parameters) efficiently. However, if you don’t have a lot of training data, you are only using a fraction of your model’s potential.

Dataset size will also depend on the domain of your task and the variance of each class.

If you plan to identify every Mars chocolate bar in the world, you’ll probably run out of variance after 10,000 examples. The model will have seen every possible angle, lighting condition, and crumpled appearance of the candy bar.

However, if you want to make a generalized person detector, 10,000 samples are only a glimpse of the variety of sizes, appearances, poses, and clothing that humans may have. As such—a class with high variance such as “person” requires a lot more training data.

đź’ˇ Pro tip: Check out 15+ Top Computer Vision Project Ideas for Beginners to start building your own computer vision models in less than an hour!