Anomaly Detection with Auto-Encoders



In data science, anomaly detection is when the system tries to detect a rare item, event or observation which deviates significantly from the original distribution and can not be defined as normal behaviour. Three different categories of anomaly detection exist:

In this project, we will be using an unsupervised approach to detect the regions of the hazelnut images that do not match the distribution.

1. Background

1.1 Anomaly Detection

Anomaly detection, also known as out-layer analysis, is a step-in data mining that identifies data points, events, and observations that deviate from a data set’s normal behaviour. It is the process of finding patterns in data that do not conform to prior expected behaviour. Anomaly detection is being employed increasingly in the presence of big data that is captured by sensors, social media platforms, huge networks, etc. They also have many applications, including energy systems, medical devices, banking network intrusion detection, etc. Machine learning is progressively being used to automate anomaly detection.

1.2. Auto-Encoders

Auto Encoder is a generative unsupervised deep learning the algorithm used for reconstructing high-dimensional input data using a neural network with a narrow bottleneck layer in the middle that contains the latent representation of the input data. Auto-Encoders have two main parts called, Encoder and Decoder.

Figure 1

1.3 Accuracy methods

In order to be able to compare our results with those already published, we need to have a standard accuracy method. First off, we need to know the concept of True and Predicted conditions. Two types of correct predictions (True Positives and True Negatives), and there are two types of errors. Error Type I for any observation predicted positive when it is negative (False Positive, also called False Alert). Error Type II for any observation predicted negative when it is positive (False Negative).

Figure 2

2. Dataset

For our project, we used the MVTec dataset that belonged to Paul Bergmann, January 2021,this data set contained several different groups of images such as Bottles, Cables, Capsule, Carpet, Grid, Hazelnut, Leather, Metal Nut, Pill, Screw, Tile, Toothbrush, Transistor, Wood and Zipper. Due to the limitation, we were able only to train one of the data sets for our project. We used the Hazelnut dataset, which contained 390 defect-free images. Deep neural networks are known for their impressive accuracy when trained with a large dataset. Therefore we used data augmentation to increase the number of data samples we have. We used rotation, flipping and adding noise to increase the data samples. In the end, we had approximately 2000 images. There are 70 test images which different types, such as crack, cut, hole and print. The dataset also includes ground Truth to calculate the accuracy of the testing.

2.1 Train Images

For training, we have 390 fine images of hazelnuts. look at figure 3.
Figure 3

2.1 Test Images

There are 4 different testing sets, including crack, cut, hole, and print. Figure 4 shows some samples of these test images.
Figure 4
Additional to the defective images, we have the segmentation of the defective region of the images.

3. Network Architect

Our model had two main parts the Encoder and the Decoder. Each one had several convolutional layers followed by Batch-Normalization with 0.1 momentum and ϵ = 10−6 . For the activation function, we used the LeakyReLU with negative slope 0.01.

3.1 Encoder Architect

Table 1 shows each layer and its dimensions for the Encoder.
LAYER TYPE IN OUT KERNEL STRIDE
CONV2D 3 3 4 2
BATCHNORM2D, LEAKYRELU 3
CONV2D 3 32 4 2
BATCHNORM2D, LEAKYRELU 32
CONV2D 32 32 4 2
BATCHNORM2D, LEAKYRELU 32
CONV2D 32 32 3 1
BATCHNORM2D, LEAKYRELU 32
CONV2D 32 64 4 2
BATCHNORM2D, LEAKYRELU 64
CONV2D 64 64 3 1
BATCHNORM2D, LEAKYRELU 64
CONV2D 64 128 4 2
BATCHNORM2D, LEAKYRELU 128
CONV2D 128 64 3 1
BATCHNORM2D, LEAKYRELU 64
CONV2D 64 32 3 1
BATCHNORM2D, LEAKYRELU 32
CONV2D 32 500 8 1
BATCHNORM2D 500
Table 1

3.2 Decoder Architect

Table 2 shows each layer and its dimensions for the Dencoder.
LAYER TYPE IN OUT KERNEL STRIDE
CONVTRANSPOSE2D 500 32 8 1
BATCHNORM2D, LEAKYRELU 32
CONVTRANSPOSE2D 32 64 3 1
BATCHNORM2D, LEAKYRELU 64
CONVTRANSPOSE2D 64 128 3 1
BATCHNORM2D, LEAKYRELU 128
CONVTRANSPOSE2D 128 64 4 2
BATCHNORM2D, LEAKYRELU 64
CONVTRANSPOSE2D 64 64 3 1
BATCHNORM2D, LEAKYRELU 64
CONVTRANSPOSE2D 64 32 4 2
BATCHNORM2D, LEAKYRELU 32
CONVTRANSPOSE2D 32 32 3 1
BATCHNORM2D, LEAKYRELU 32
CONVTRANSPOSE2D 32 3 4 2
BATCHNORM2D, LEAKYRELU 64
CONVTRANSPOSE2D 3 3 4 2
Table 2

4. Experiments

4.1 Training

The model has been trained over the mentioned dataset for 1000 epochs. We used the ``L2`` loss for training our network using the ADAM optimizer. Figure 5 shows the loss for each step (each epoch has approximately 63 steps).
Figure 5
According the figure 2 we have an understanding towards True Positive, True Negative, False Positive and False Negative. Based on that, it is common to analyze four metrics which are:
Precision` = (TP)/(TP+FP)` ` ` ` ` ` ` ` ` ` ` ` ` Recall` = (TP)/(TP+FN)`
Accuracy` = (TP + TN)/(TP + FP + FN + TN)`
But because Precision and Recall usually play against each other, we may rely on their harmonic mean, which is the F1-score.
`F_1 = (2)/(Recall^(-1) + Precision^(-1)) = (2 * Recall *Precision)/(Recall + Precision)`
Figure 7 shows the F1 score during training over the Hole and Print validation set.
Figure 6
Figure 6 shows the F1 score during training over the Crack and Cut validation set.
Figure 7

4.2 Methodology

We are going to train our network to reconstruct the input image from the latent space. We do that by feeding many normal images to the network so the network will learn the distribution of standard and defect-free images. Since the latent space is smaller than the size of the image, the network learns the essential components of the image. After that, since it has learned the critical part, whenever it sees a new image, it will try to use the things it learned to construct the image. Since a defective image has regions that deviate from its distribution, it will have a high constructed error (difference between the input and output), and by that, we find the defective pixels.
Figure 8
Figure 8 demonstrates the algorithm that we used to detect the pixels in which they were carrying outliers.

4.3 Outputs

We used the algorithm to detect the defect in the testing images, and for every dataset, here is the result. From left to right Original Image, Output Image, Predicted Pixels, Ground Truth. The result can change depending on the value of τ.

5. Discussions

5.1 Effect of Threshold

The output depends on the parameter τ. In figures 13 and 14, the effect of the threshold can be seen. If the threshold value is too high (for example, τ = 0.7 in figure 14), the system will miss some defective pixels. Therefore the F1 accuracy will decrease. If the threshold value is too low (for example, τ = 0.1 in figure 14), the system will assume that the defect-free pixels are also defective. In this case, the F1 will also decrease since many pixels do not contain faulty pixels, but the system will predict it anyway.
Figure 13
Figure 14 shows the effect of different τ .From left to right, τ=0.1, τ=0.2, τ=0.7.
Figure 14

5.2 Effect of Latent Space size

In our model and result, the dimension of the latent space was 500. However, this is a hyper-parameter that we could change. Changing this parameter will cause effects on the output. As we mentioned, Decoder's objective is to reconstruct the input data from the output of the Encoder (aks the latent vector). If the latent size space is too small, the model will face difficulties learning small details since it has less power to learn the latent space distribution. Therefore in some cases, it is likely that some pixels have defected while they are entirely defect-free. If the latent space is too large, the model will start to memorize the input image instead of learning the image features. The model is likely to construct an image that has the same defective parts. This might lead to a situation that the model is not able to detect the faulty pixels in an image.

In the last part, we are going to compare our results with other methods and approaches that are out there.

6. Conclusions

In the last part, we are going to compare our results with other methods and approaches that are out there.

METHODS RECALL PRECISION F1
OUR METHOD (PRINT) 0.22 0.82 0.35
OUR METHOD (CRACK) 0.26 0.57 0.36
OUR METHOD (CUT) 0.69 0.8 0.57
OUR METHOD (HOLE) 0.55 0.61 0.75
AE(SSIM) 0.08 1 0.15
AE(L2) 0.84 0.93 0.88
ANOGAN 0.16 0.83 0.27
CNN FEATURE DICTIONARY 0.07 0.9 0.13


We compared our approach with some of the models that are already used and it seems Paul Bergmann, January 2021 has the best performance when they train their model on an L2 loss function. But it seems that our approach is good since we outperform other methods such as AnoGANs, Auto Encoders using SSIM and CNN feature Dictionary. However, our approach has a lower accuracy in all 3 metrics (Recall, Precision and F1) compared to the state-of-the-art method. In one special case, the AE approach with the SSIM loss function was able to reach 100 % for precision, however, they were not able to have a good performance of the Recall, therefor according to our F1 metric, our approach outperforms the AE(SSIM) approach.

More infomation