[Introduction]This article is the second part of a series of articles, focusing on the characteristics and applications of convolutional neural networks (CNN). CNNs are mainly used for pattern recognition and object classification. In the first part of the article “Introduction to Convolutional Neural Networks: What is Machine Learning?” – Part 1, we compared running a classical linear programming program on a microcontroller with a CNN and showed the advantages of CNNs. We also explored the CIFAR network, which can classify objects such as cats, houses, or bicycles in images, and can also perform simple speech recognition. This article focuses on explaining how to train these neural networks to solve practical problems.

Neural Network Training Process

The CIFAR network discussed in the first part of this series consists of different layers of neurons. As shown in Figure 1, image data of 32 × 32 pixels is presented to the network and passed through the network layers. The first step in the CNN processing process is to extract the characteristics and structure of the object to be distinguished, which needs to be realized with the help of a filter matrix. After designers model the CIFAR network, the network at this stage cannot detect patterns and objects because these filter matrices cannot be determined initially.

To do this, it is first necessary to determine all the parameters of the filter matrix to maximize the accuracy of detecting objects or minimize the loss function. This process is called neural network training. Common applications described in the first part of this series allow the network to be trained once during development and testing, without the need to tune parameters. If the system is classifying familiar objects, no additional training is required; when the system needs to classify entirely new objects, additional training is required.

Network training requires the use of a training dataset and a similar set of test datasets to test the accuracy of the network. For example, the CIFAR-10 network dataset is a collection of images of ten object classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, and truck. We have to name these images before training the CNN, which is the most complicated part of the AI application development process. The training process discussed in this article uses the principle of backpropagation, which is to show the network a large number of images in succession, and transmit a target value at the same time each time. The target values for this example are the relevant object classes in the image. Each time the image is displayed, the filter matrix is optimized so that the target and actual values of the object class match. A network that goes through this process is able to detect objects in images it never saw during training.

Figure 1. CIFAR CNN architecture.

Figure 2. A training loop consisting of forward and backpropagation.

Overfitting and Underfitting

Questions that often arise during the modeling of neural networks are: how many layers the neural network should have, or how large the filter matrix of the neural network should be. Answering this question is not easy, so it is crucial to discuss overfitting and underfitting of the network. Overfitting is caused by a model that is too complex and has too many parameters. We can determine how well a predictive model fits the training dataset by comparing the loss on the training dataset and the test dataset. If the loss is low during training and increases excessively when the network is presented with test data it has never been shown, this is a strong indication that the network has memorized the training data rather than performing pattern recognition. This kind of situation mainly occurs when the parameter storage space of the network is too large or when there are too many convolutional layers of the network. In this case, the network size should be reduced.

Loss function and training algorithm

Learning takes place in two steps. In the first step, the network is shown images, which are then processed by a network of neurons to produce an output vector. The maximum value of the output vector represents the detected object class, such as “dog” in the example, this value is not necessarily correct. This step is called forward propagation.

The difference between the target value and the actual value produced at the output is called the loss, and the related function is called the loss function. All elements and parameters of the network are included in the loss function. The learning process of a neural network aims to define these parameters in such a way that the loss function is minimized. This minimization can be achieved through the process of backpropagation. In the process of backpropagation, the bias generated by the output (loss = target value – actual value) is fed back through the layers of the network until it reaches the initial layer of the network.

Thus, forward propagation and backpropagation generate a loop during training that progressively determines the parameters of the filter matrix. This cyclic process will be repeated until the loss value falls below a certain level.

Optimization Algorithms, Gradients, and Gradient Descent

To illustrate the training process, Figure 3 shows an example of a loss function with two parameters x and y, where the z-axis corresponds to the loss. If we look closely at the 3D plot of this loss function, we see that this function has a global minimum and a local minimum.

Currently, there are a large number of numerical optimization algorithms available for determining weights and biases. Among them, the gradient descent method is the simplest. The idea of the gradient descent method is to use the gradient operator to find a path to the global minimum in the process of step-by-step training, the starting point of the path is randomly selected from the loss function. The gradient operator is a mathematical operator that generates a gradient vector at each point of the loss function. The direction of this vector points to the direction in which the value of the function changes the most, and the magnitude corresponds to the degree of change in the value of the function. In the function of Figure 3, the lower right corner (at the red arrow) has a smaller magnitude of the gradient vector due to the flat surface. The situation is quite different when approaching the peak. Here the direction of the vector (green arrow) is sharply downward, and because of the sharp difference in elevation here, the magnitude of the gradient vector is also large.

Figure 3. Identifying different paths to the minimum using gradient descent.

So we can use gradient descent to iteratively find the steepest path down the valley from an arbitrarily chosen starting point. This means that the optimization algorithm computes the gradient at the starting point and takes a small step in the direction of the steepest descent. The algorithm then recomputes the gradient at that point and continues to find a path from the starting point to the valley. The problem with this approach is that the starting point is not defined in advance, but chosen randomly. In our 3D map, some careful readers will place the starting point somewhere on the left side of the function graph to ensure that the path ends at the global minimum (shown by the blue path). The other two paths (yellow and orange) are either very long or end at local minima. However, the algorithm must optimize thousands of parameters, and obviously the choice of the starting point cannot happen to be correct every time. In practice, this approach is not very useful. Because the selected starting point may result in a long path (i.e., training time), or the target point is not located at the global minimum, resulting in a decrease in the accuracy of the network.

Therefore, in order to avoid the above-mentioned problems, a large number of alternative optimization algorithms have been developed in the past few years. Some alternative methods include Stochastic Gradient Descent, Momentum, AdaGrad, RMSProp, Adam, etc. Given that each algorithm has its specific strengths and weaknesses, the exact algorithm used in practice will be determined by the web developer.

training data

During training, we feed the network images labeled with the correct object class, such as cars, ships, etc. This example uses the existing CIFAR-10 dataset. Of course, in practice, AI may be used in areas other than cats, dogs, and cars. This may require the development of new applications, such as detecting the quality of screws in the manufacturing process. The network must be trained using training data that can distinguish between good and bad screws. Creating such datasets is extremely time-consuming and often the most expensive step in developing AI applications. The compiled data set is divided into training data set and test data set. The training dataset is used for training, while the test data is used to check the functionality of the trained network at the end of the development process.

in conclusion

Part 1 of this series, Introduction to Artificial Intelligence: What is Machine Learning? — Part 1″ introduces neural networks and discusses their design and function in detail. This article defines all the weights and biases needed for the function, so it can now be assumed that the network is functioning properly. In a follow-up article in part three, we will run the neural network through the hardware to test its ability to recognize cats. Here we will use the MAX78000 artificial intelligence microcontroller with a hardware CNN accelerator developed by Analog Devices for demonstration.

[Introduction]This article is the second part of a series of articles, focusing on the characteristics and applications of convolutional neural networks (CNN). CNNs are mainly used for pattern recognition and object classification. In the first part of the article “Introduction to Convolutional Neural Networks: What is Machine Learning?” – Part 1, we compared running a classical linear programming program on a microcontroller with a CNN and showed the advantages of CNNs. We also explored the CIFAR network, which can classify objects such as cats, houses, or bicycles in images, and can also perform simple speech recognition. This article focuses on explaining how to train these neural networks to solve practical problems.

Neural Network Training Process

The CIFAR network discussed in the first part of this series consists of different layers of neurons. As shown in Figure 1, image data of 32 × 32 pixels is presented to the network and passed through the network layers. The first step in the CNN processing process is to extract the characteristics and structure of the object to be distinguished, which needs to be realized with the help of a filter matrix. After designers model the CIFAR network, the network at this stage cannot detect patterns and objects because these filter matrices cannot be determined initially.

To do this, it is first necessary to determine all the parameters of the filter matrix to maximize the accuracy of detecting objects or minimize the loss function. This process is called neural network training. Common applications described in the first part of this series allow the network to be trained once during development and testing, without the need to tune parameters. If the system is classifying familiar objects, no additional training is required; when the system needs to classify entirely new objects, additional training is required.

Network training requires the use of a training dataset and a similar set of test datasets to test the accuracy of the network. For example, the CIFAR-10 network dataset is a collection of images of ten object classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, and truck. We have to name these images before training the CNN, which is the most complicated part of the AI application development process. The training process discussed in this article uses the principle of backpropagation, which is to show the network a large number of images in succession, and transmit a target value at the same time each time. The target values for this example are the relevant object classes in the image. Each time the image is displayed, the filter matrix is optimized so that the target and actual values of the object class match. A network that goes through this process is able to detect objects in images it never saw during training.

Figure 1. CIFAR CNN architecture.

Figure 2. A training loop consisting of forward and backpropagation.

Overfitting and Underfitting

Questions that often arise during the modeling of neural networks are: how many layers the neural network should have, or how large the filter matrix of the neural network should be. Answering this question is not easy, so it is crucial to discuss overfitting and underfitting of the network. Overfitting is caused by a model that is too complex and has too many parameters. We can determine how well a predictive model fits the training dataset by comparing the loss on the training dataset and the test dataset. If the loss is low during training and increases excessively when the network is presented with test data it has never been shown, this is a strong indication that the network has memorized the training data rather than performing pattern recognition. This kind of situation mainly occurs when the parameter storage space of the network is too large or when there are too many convolutional layers of the network. In this case, the network size should be reduced.

Loss function and training algorithm

Learning takes place in two steps. In the first step, the network is shown images, which are then processed by a network of neurons to produce an output vector. The maximum value of the output vector represents the detected object class, such as “dog” in the example, this value is not necessarily correct. This step is called forward propagation.

The difference between the target value and the actual value produced at the output is called the loss, and the related function is called the loss function. All elements and parameters of the network are included in the loss function. The learning process of a neural network aims to define these parameters in such a way that the loss function is minimized. This minimization can be achieved through the process of backpropagation. In the process of backpropagation, the bias generated by the output (loss = target value – actual value) is fed back through the layers of the network until it reaches the initial layer of the network.

Thus, forward propagation and backpropagation generate a loop during training that progressively determines the parameters of the filter matrix. This cyclic process will be repeated until the loss value falls below a certain level.

Optimization Algorithms, Gradients, and Gradient Descent

To illustrate the training process, Figure 3 shows an example of a loss function with two parameters x and y, where the z-axis corresponds to the loss. If we look closely at the 3D plot of this loss function, we see that this function has a global minimum and a local minimum.

Currently, there are a large number of numerical optimization algorithms available for determining weights and biases. Among them, the gradient descent method is the simplest. The idea of the gradient descent method is to use the gradient operator to find a path to the global minimum in the process of step-by-step training, the starting point of the path is randomly selected from the loss function. The gradient operator is a mathematical operator that generates a gradient vector at each point of the loss function. The direction of this vector points to the direction in which the value of the function changes the most, and the magnitude corresponds to the degree of change in the value of the function. In the function of Figure 3, the lower right corner (at the red arrow) has a smaller magnitude of the gradient vector due to the flat surface. The situation is quite different when approaching the peak. Here the direction of the vector (green arrow) is sharply downward, and because of the sharp difference in elevation here, the magnitude of the gradient vector is also large.

Figure 3. Identifying different paths to the minimum using gradient descent.

So we can use gradient descent to iteratively find the steepest path down the valley from an arbitrarily chosen starting point. This means that the optimization algorithm computes the gradient at the starting point and takes a small step in the direction of the steepest descent. The algorithm then recomputes the gradient at that point and continues to find a path from the starting point to the valley. The problem with this approach is that the starting point is not defined in advance, but chosen randomly. In our 3D map, some careful readers will place the starting point somewhere on the left side of the function graph to ensure that the path ends at the global minimum (shown by the blue path). The other two paths (yellow and orange) are either very long or end at local minima. However, the algorithm must optimize thousands of parameters, and obviously the choice of the starting point cannot happen to be correct every time. In practice, this approach is not very useful. Because the selected starting point may result in a long path (i.e., training time), or the target point is not located at the global minimum, resulting in a decrease in the accuracy of the network.

Therefore, in order to avoid the above-mentioned problems, a large number of alternative optimization algorithms have been developed in the past few years. Some alternative methods include Stochastic Gradient Descent, Momentum, AdaGrad, RMSProp, Adam, etc. Given that each algorithm has its specific strengths and weaknesses, the exact algorithm used in practice will be determined by the web developer.

training data

During training, we feed the network images labeled with the correct object class, such as cars, ships, etc. This example uses the existing CIFAR-10 dataset. Of course, in practice, AI may be used in areas other than cats, dogs, and cars. This may require the development of new applications, such as detecting the quality of screws in the manufacturing process. The network must be trained using training data that can distinguish between good and bad screws. Creating such datasets is extremely time-consuming and often the most expensive step in developing AI applications. The compiled data set is divided into training data set and test data set. The training dataset is used for training, while the test data is used to check the functionality of the trained network at the end of the development process.

in conclusion

Part 1 of this series, Introduction to Artificial Intelligence: What is Machine Learning? — Part 1″ introduces neural networks and discusses their design and function in detail. This article defines all the weights and biases needed for the function, so it can now be assumed that the network is functioning properly. In a follow-up article in part three, we will run the neural network through the hardware to test its ability to recognize cats. Here we will use the MAX78000 artificial intelligence microcontroller with a hardware CNN accelerator developed by Analog Devices for demonstration.

The Links: **6MBI450U-170** **FP35R12W2T4**