You want your new deep learning application to be successful? Then you should be careful with your data handling. In every machine vision application it is important to work with “high-quality” image data. However, in case of deep learning applications, this statement is even more important.
No matter which method, resp. feature, you are using - classification, object detection, segmentation, or anomaly detection - the deep learning networks have to be trained by data in all methods. Keep in mind: Each deep learning network can only learn what it sees!
For this reason, there are some important rules which should be considered when generating your data set for training:
- Acquire the deep learning image data under conditions that are similar or, even better, identical to the expected scenario in the live application. Only for experimenting purposes, images can be acquired in a laboratory setup.
- The training data must cover all variations that can occur during the online process. This also includes variations of general conditions, such as variations in illumination.
- The training data must be independent. It should not contain multiple data of the same object.
- The more training data you acquire following steps 1, 2, and 3 the better it is.
Beside the acquired image data, the second very important part of the data set is the labeling of the data. Of course the labeling has to be correct, but it also has to be accurate. Especially for object detection and segmentation, an accurate labeling is essential for an accurate localization in the online process. Again, the network can only learn the accuracy which is given in the labeled training set. It is also very important that the labeling is extremely consistent. You have to label every object in the data set and every object within one class in the same way.
The correctness of the labeling seems obvious and simple. However, in case of hundreds of labeled object, it is not uncommon that errors, i.e., wrongly labeled data, occur. In this case, the new Review tab in the MVTec Deep Learning Tool is very well suited to find the mislabeled data very fast. So take a look into this new feature and get rid of your erroneous data.