Efficient commercial classification of agricultural products using convolutional neural networks

ABSTRACT


INTRODUCTION
Barcoding of products is very common in stores, but for some products, such as agricultural, it is challenging, time-consuming, and it requires providing special instruments for printing and reading barcodes and maintenance of them, which results in high costs [1]. Likewise, barcodes are spoiled or distorted due to insignificant variations such as changes in the color of the background, the width of the barcode, and the formation of added outlines based on environmental circumstances, for example, light, moisture, and temperature, which necessitate the re-encoding of the products [1]- [7]. The barcode's substances are also toxic or plastic, which can damage agricultural or environmental food products [5].
Nowadays, most of the agricultural products available in stores have labels with the brand of their sales company; most of these labels are made of plastic, which causes environmental damage due to their non-biodegradability. Some procedures are currently under review to replace these labels, including edible labels, edible ink, or laser labeling on the fruits [6]. Classification of plant images using neural networks and machine learning techniques is among the least expensive and accurate methods since this method does not need any specific equipment (just a camera) comparable to QR code [2], [8]- [11].
The main advantage of such a system is that sensitivity to various environmental conditions such as light, heat, or humidity is almost eliminated, and there is no need to barcode the products since the visual features of the product will replace the barcode. This research designed a system to categorize different flowers with different colors and types without sensitivity to environmental conditions, including light and background patterns, using the convolutional neural network and VGG16 design trained on imagenet. We aimed to reduce the classification error related to agricultural products without constructing any waste and increasing product identification speed and its lifetime. The obtained results demonstrated the efficiency of the proposed approach.

CONVOLUTIONAL NEURAL NETWORKS
Neural networks have led to great advances in machine learning and are the main reason for the deep learning boom. Neural networks such as convolutional neural networks (CNNs) have successfully worked with image data. Since their use in the ImageNet competition in 2012, they have been at the forefront of research and industry in working with images.
The CNNs are powerful neural networks that use filters to extract features from images. This task is done so that the information of pixel situations is protected. Accurately, they are neural networks in which convolution is used instead of general matrix multiplication. With this approach, the forward function would perform better, and the number of network parameters would be meaningfully reduced. Most common frameworks can support CNNs, such as Tensorflow-Keras and PyTorch. Also, the possibility of custom design of CNNs is available based on the NumPy library. Convolutional neural networks have diverse applications industrially, particularly in computer vision (face recognition, paper documents/optical character recognition (OCR) digitalism, internet of things).
An image received by a computer will be perceived as an array of numbers. The number of arrays rests on the image size (based on pixels). For instance, if a joint photographic group (JPG)-colored image (480*480 px) is inputted to a computer, its replacement array would have 3*480*480 cells (number 3 is pointed to red, green, and blue (RGB)). Each JPG image array element is numbered between 0 and 255 (because the variables are stored as eight units), which indicates the pixel intensity.
As shown in Figure 1, the neural network's input image is divided and processed into smaller arrays, called receptive field arrays, and the filter that dot product by the receptive field is called the kernel; which is an array of the same size as the receptive field that the numbers in the filter are called weight or parameters, respectively. At each glance, the filter sees a portion of the image and then, it moves over it to scan other parts as well (i.e., it convolves). The result of convolving the image with the filter (kernel) is a 2D array called activation map or feature map. If we use p filters instead of one filter, we will have p arrays, and since each filter extract one feature from the image, using a higher number of filters can increase accuracy. Convolution allows the ability to perform different filters such as derivative calculation, edge recognition, and blurring. Each kernel is a small matrix with an odd-dimensional size, and the midpoint of this matrix is called an anchor. An anchor is utilized to indicate the placement of the filter on the image. Moreover, the filter depth is equal to the image depth, i.e., the image depth of y*y*3 is equal to the kernel size of x*x*3.
The filter is of same size as the receptive field, and the result of convolving the image using any filter is a feature map. Each filter must have the same depth as the image (i.e., in this image, the filter and the image both have three RGB color depths). Note that the filter size (height and width) should be an odd number to place the anchor in the center of it.
As shown in Figure 2, the CNN network is composed of three main layers, namely, convolutional layer, pooling layer, and fully-connected layer [12]. In the first level, there is a convolution layer that can extract new features from the image using various kernels. Then, max-pooling is carried out with the aim of noise reduction, overfitting prevention, and decreasing network parameters and dimensions. The combination of "convolution + max-pooling," known as the convolution layer, can be repeated many times to build a deeper network. The max-pooling layer's output is referred to as the fully-connected layer after alteration to the 1D vector. The standard algorithms of neural networks in fully connected layers are employed to output a single vector for which the number of elements is equal to the number of classes defined by the user.

CNN LAYERS 3.1. Convolution layer
The general mathematical format of convolving two 2D matrixes is presented in (1) [13]. (1) According to that equation, the kernel must initially be horizontally and vertically flipped for convolving the kernel with the image, and every time the filter is located on a specific portion of the image, the element with the same indices is multiplied, and the sum of all results is placed in the anchor element. Figure 3 indicates the steps of the filter convolving with the image. By using image convolving by the filter, the matrix size would be reduced because the margin elements of the matrix related to the image will never be located in the center of the filter (The anchor is never placed in the margin elements of the image array). As will be detailed in section 3.4, prevention of the matrix's size and loss of image data is done by padding, i.e., adding rows and columns to the image edges, which is a type of zero padding. Convolving a 3*3 filter with a 7*7 image, at which receptive field is indicated by the red square in the image and anchor is shown by a blue square; the sum of multiplying receptive field elements by the filter is located in the anchor. Feature map will have a size of 5*5.
Similarly, in convolving a filter with an image, a constant is added to the available sum, called bias. According to Figure 3, the bias is then equal to zero. When the function of the linear components is ended with the input, a non-linear function is applied. In this stride, the activation function is implied to the result obtained from the linear components. Based on Figure 4, this function converts the input signals to the output ones. Let us consider n-numbered inputs, X1 to Xn, applied to corresponding weights (W1 to Wn). The weights are initially multiplied at their inputs, and added to the bias. The result is u. An activation function, f is then implied to u, and lastly, the final output is reached as yk from the neuron. There are numerous activation functions from which Sigmoid, ReLU, and softmax are the most common.

Max pooling layer
The pooling layer reduces the image size since this layer combines the adjacent pixels of a specific portion of the image and converts them to a unit value. Many image-processing applications utilize the pooling technique. To perform the operation in the pooling layer, the pixel selection of images and their expected value should be determined. The adjacent pixels are usually selected from the square matrix and the numbers of selected pixels are varied based on the related problem. As shown in Figure 5, the representative value is usually the mean or maximum of pixels [14].

Fully connected layer
The fully-connected layer transfigures 2D feature maps to a 1D-feature vector to endure the display process of the features, see Figure 6. The fully connected layer permits a presentation of the vector with a given size as the network output, which can be further used to classify the images or carry on the subsequent processes. Using this method, the convolution network transfigures the main image's raw pixels into the classes' scores layer-by-layer at the end of the network. The main issue of these layers is the high number of parameters, leading to the high cost of the process function spent during training time.
So, one commonly used method is to either remove these layers altogether or reduce the number of connections in these layers using some alternatives. Explicitly, convolution and fully-connected layers are the layers with the ability to perform some transformations that function as the activations in the input and the function of the parameters of neuron weight and bias. On the other hand, the pooling and ReLU layers implement a constant function; the available parameters in convolution and fully connected layers are trained by the descending gradient method to meet the classes' scores counted by that convolution network the labels of each image in the train set [15].

Hyperparameter
Three parameters control the size of the output, i.e., depth, stride, and padding, which is a type of padding that is zero padding. The output is an arbitrarily selected parameter that controls the neuron numbers connected to a region in the input in the convolution layer. Using stride parameter, the deep columns are determined around the spatial dimensions (width and height). Sometimes, it is better to cover the input boundary with zero. In other words, the image frame is filled with zero: a row and column of 0 are added to the opening and end of the image. Hence, the image is placed in the "zero" frame, as shown in Figure 7.
Where W stands for the input width, F for the filter width, P for the zero padding, and S for the stride. If the equation result is a decimal number, the stride's utilized value is not correct, and that neurons have not the ability to be arranged next to each other with this value and cannot be well located across the input flipped. For a better understanding, let us consider the equation of Figure 8. In that figure, zero padding is added to the input width of 5. Formerly, by using a filter of size 3 and stride 1, the image's left side is convoluted and the output size is equal to 5. By using a filter of size 3 and stride 2, the right side of the image is convoluted with an output size of 3. It shows that the output is reduced, and the data set is lost even though the padding is used. Generally, zero padding is set based on (3).

TRANSFER LEARNING
Transfer learning is a machine learning technique whose pre-trained pattern may be retargeted for the second task [16], [17]. The other important application of transfer learning is the high performance of distinguishing similar images using a pre-trained model in small data sets. Many of the pre-trained models are available such as bidirectional encoder representations from transformers (BERT), universal language model fine-tuning (ULMFiT), and visual geometry group (VGG). Among these models, the proper one for this study was investigated. BERT and ULMFiT are used for language modeling, and VGG16 is used for image classification. VGG16 may have different weights; for example, the trained VGG16 on ImageNet or the trained VGG16 on modified national institute of standards and technology database (MNIST) consists of 1000 classes of 1.2 million images on the data set. Some of these data set classes are animals, cars, stores, dogs, food, tools, etc. On the other hand, MNIST is learned using handwritten numbers, including ten classes from 0 to 9. Considering the case of ornamental flower images, the trained VGG16 on imagenet was found to be the most suitable.

VGG16
In 2014, a standard network design was proposed by VGGNet. It utilized the very small convolution filters of 3*3 and split the number of feature maps after the 2*2 pooling. The network depth was increased from 16 to 19 layers, which led to the flexibility increase for learning non-linear scales by deep learning. Images are required to be in the form (224, 224, 3) in VGG16. The intensity average of all pixels minus each pixel is performed as the only pre-processing. The image passes through a group of convolution layers, with very small filters of 3*3. Filters of 1*1 are utilized in one of the configurations used as a linear conversion of the input channels. The filter stride is the constant amount of a pixel. Padding 3*3 is also used. The pooling layers are 2*2 windows with a stride of 2. Three fully-connected layers follow a set of convolution layers. Each of the first two layers has 4096 channels, while the third layer has 1000 classes. The last layer is called softmax. The configuration of fully-connected layers is similar in all networks. All hidden layers are equipped with a linear rectifier unit. The architecture of this network is indicated in Figure 9. The depth of the configuration increases from left to right because the added layers are increased [18], [19].
The important point in machine learning systems is data overfitting, leading to errors in the test data thus, showing a high error toward the new data. For this purpose, two concepts of dropout and data augmentation are very beneficial.

Dropout
Using all same-weighted neurons is among the reasons for overfitting during machine learning of the system. In dropout layer, a fraction of weights was utilized at each learning phase instead of learning all the weights simultaneously. That is to say, a probability (0 to 1) can be utilized for neurons in the learning system. The probability of zero value means using all neurons in the learning phase, while the probability of one value means excluding all neurons from the learning stage [20].
The eliminated neurons no longer participate in the forward or backward pass using this method. Then, every time an input is available, the neural network shows different architectures, but all of these architectures share the weights. This method reduced the complexity of neuron organizing because a neuron cannot depend on any certain neurons' presence. Therefore, it is required to learn more powerful features that would be useful in relationship with different random sets of neurons. In this study, the dropout value was determined as 0.5.

Data augmentation
The small amount of available data set is the other reason for overfitting. To overcome this issue, one solution is data augmentation, in which the data set is extended by small changes such as spinning and flipping without the need to collect new data. Besides, the defined model can be Invariant using data augmentation. An invariant convolutional neural network can classify things in different directions in a robust way. In particular, CNN can be invariant toward movement, point of view, size, or light intensity (or a combination of them).

DESIGNED MODEL
We designed the model in four stages: a) loading the required libraries, b) outlining the model architecture and the weight of kernels (utilizing the VGGnet design), c) pre-processing of data and extracting features from image (herein, we utilized the pre-process system of trained VGG16 on ImageNet), and d) training the model using train data and setting of classification parameters using valid data.
In numerous images in the training set, the classification model has a higher chance for proper function. A total of 944 images were employed to be classified with a proportion of 15 to 85 into two train and valid datasets. The image series of 807 for train and 137 for test data was composed of eight classes: rose flower, orchid, chrysanthemum, Matricaria, bird of paradise, marguerite daisy, Alstroemeria, and Lilium. Also, we defined 35 epochs for training the model, and we have assumed 32 for batch_size. According to Figure 10, training loss decreases in each stage of training, and after the 30 th period (last epoch), it reaches 0.0056 with a training accuracy is 99.98%. The system training was completed in 5 minutes, and the time of image recognition was of 0.26 seconds.

PRELIMINARY TESTS
To determine the class associated with each image taken by a HERO 4 Black measure 4000 by 3000 pixels, the image was initially sized to 224*224*3 using ImageDataGenerator, and its features extracted from 512*7*7 VGG network. Then, the extracted features were inputted through the designed model, and for each image, a single output vector was extracted who's each element shows the extent of the image's similarity/difference to each flower class. The maximum element is chosen as the flower class, and the image of the flower is presented in the monitor along with the similarity percentages and the recognized class.
Due to VGG network use, all image sizes of the training and test phase are square. In this work, we retained eight flowers, i.e., rose flower, orchid, chrysanthemum, matricaria, bird of paradise, marguerite daisy, alstroemeria and lilium. Different weather conditions were considered, such as rainy and sunny weather, and different places were selected such as a home, garden, store. Furthermore, different hours of the day and night were considered in the test and training data. For real-time testing of the designed network, we used 68 images split into three categories: single flower (14 images), bunches (27 images), and pods (27 images).

Test 1-single flowers
The Figure 11 shows using the designed model, 14 images of flower branches were classified in this test. To test the network at different conditions, images were taken at various times of the day and night, with stem and leaves. Based on the obtained results, 12 images were detected correctly with a model accuracy of 85%. Note this can be improved by increasing the number of train photos from different angles of the flower branches. It should also be noted that the flower background is also involved in the network training. Therefore, the best way to train the network was to consider the same amount of data for the different classes and the same angles of each flower so that the designed model does not have priority in choosing the flower class. As can be observed, the flower conditions are very different. Matricaria flower was not recognized correctly because the train images of this flower are mostly in the form of big bunches.

Test 2-bouquet
In the second test as shown in Figure 12, 27 flowers were inputted as bunches. The flower bunches' challenge is the density and intertwined of the flowers and the change in the shape of the edges, making it difficult for a trained system to distinguish a flower from the rest of them and imply more errors than the mode of a single flower. Twenty-one images were recognized correctly, i.e., a reached trained network's accuracy of 77.78% due to the compression and intertwined of the flowers and non-detectable flower edges. However, the resolution and blur of the image were very effective.
Orchids were incorrectly classified due to their intertwining and chrysanthemums were incorrectly classified due to the paper around the bouquet. Also, liliums were incorrectly classified because of the presence of hands in the photo.

Test 3-flowerpots
The other challenge in the classification of flowers is the presence of other objects in the image. In this section, in addition to any special shape of the flowers and their intertwining, the shape and color of a vase or a pot, as well as the fact that a specific flower is not allocated in the entire image frame, can affect the system response. In fact, owing to numerous flowers and the allocation of a small and low-resolution segment of an image to the desired flower, the potential error increases in pots with high numbered of small flowers or large pots with similar-colored flowers. In this test, 20 flowers were correctly recognized out of 27 ones, and the system accuracy was reduced to 74.07%, see Figure 13. Chrysanthemum pots are incorrectly recognized due to the large number of flowers and Alstroemeria vases due to the high density of the photo.

RESULTS AND DISCUSSION
Based on the results above obtained from preliminary tests, all tested images were appropriately categorized according to different environmental conditions such as the time of the day and the flower states (with or without branches, bunch, and/or pods). The following conditions are operative on system error within the train/test data: a. Imaging from different angles and various spaces. b. Balance in the number and condition of imaging in each class such as packing of flowers in the image, imaging angles, and imaging environment. c. Quality of the camera and image opaqueness. d. Similarity degree of the leaves. e. Effect of the branches.
Also, the conditions that affect the tests, including: a. Compression degree of flowers and leaves. b. Possible presence of other items rather than flowers, such as covering paper, hands, pods c. Changing of image size (drag rate). Considering the obtained results from the previous section, we changed the data set and the designed network into a data set of flowers compactly put next to each other from different angles, and in the images, leaves and flower petals cover most of the image space with completely recognizable edges. Likewise, each image was processed ten times using the following algorithm: the images are outputted into a 10*8 matrix of elements xij, which stands for the i th image and the j th category of the flowers. As shown in Figure 14, nine cuts are created with the ½ dimensions of the image. For instance, if the maximum matrix element is 0.99 with i=1 and j=4, it is thus located in the upper left corner of the image with a similarity of 99% to the marguerite daisy.
The proposed algorithm was then included to the network. From that, only two images out of 68 images were detected incorrectly, and we received an increment of system accuracy up to 97% as shown in Figure 15. The proposed algorithm was performed over the zoomed test images, preventing overcrowding and image error. The paper around the chrysanthemum was eliminated, the marguerite daisy and matricaria were exaggerated, and more details were entered into the network. However, due to the very similarity of papillon to the lilium, the Figure (

CONCLUSION
Currently, barcoding of products is performed to recognize the type and price of products in stores. However, the barcoding of agricultural products is harmful to both the product and the environment and reduces the lifetime of the product due to the need for one-by-one barcoding and the presence of toxic materials in barcodes, and costly and time-consuming procedures. The application of machine learning and the image process allows for identifying products with high accuracy and swiftness using a simple camera without producing any waste-free of barcoding. In this research, we investigated the classification of eight different kinds of ornamental flowers in different environmental conditions by designing a convolution network of VGG16 to increase the rapidity and accurateness of their detection in stores, and increase the lifetime of agricultural products, and prevention of damages toward products and environment as a result. The obtained results demonstrated that the designed network categorizes the input images at high speed with 97% accuracy, with the ability to categorize the inputted images simultaneously.