Citrus Fruit Quality Classification using Support Vector Machines

— The large-scale fruit selection process is still manual or semi-automatic, mainly in small industries. This fact can lead to errors during the sorting of good fruits. Thus, this paper proposes an application using computer vision and machine learning to improve this task. The genus studied was the citrus, more specific the orange, one of the most produced fruit in Brazil. However, the methodology used can be applied on any fruit which quality can be measured by vision. The initial step was the construction of the learning space, consisting of image acquisition, pre-processing and features extraction. After the construction, the learning phase begins, consisted of the training of the support vector machine model, and then, statistical methods were used to valida te the model. As the final result, it achieved the accuracy of 97.3% in fruit classify.


I. INTRODUCTION
Brazil is one of the largest fruit producers in the world, in accordance with FAO, Food and Agriculture Organization of the United Nations, it produced more than 40 million tons of fresh fruits in 2017, which 17 millio n tons were alone by orange fruit. In Brazil, the sorting process still is manual leading to errors in the quality inspection, due to the intensive, repetitive and tedious work routines, resulting in low-quality fruits that affect commercial acceptance [1].
With the increasing demand for the use of Artificial Intelligence, new areas had diverged like computer vision, machine learning, and the most recent area, Deep Learning (DL). Computer Vision (CV) aims to behave like human perception, using image processing and analysis to achieve this goal. Both Machine Learning (ML) and DL tend to minimize the intra-class variance along with the feature space for the given classes [2], the main difference is on the feature extraction phase. ML models often use feature extraction algorithms to find edges, corners, and descriptor like SIFT [3] and SURF [4], to create the feature vector as input for the training model. DL models use a hierarchical set of layers that produces learning representations from data, some layers can abstract the concept of edges, others contours, colors information, etc. In this approach, the model learns from the data, extracting features from the convolutions and pooling operations through the connected layers [4], [5]. The main idea for the feature extraction is to reduce the dimensionality, using obtained characteristics features from the signals, instead of the signal themselves [6].
The selection process induces the problem of the attribution of quality in the fruits, which, even according to legal standards, has a certain degree of subjectivity. Another aggravating factor is the possibility of a wrong classification by the person since human perception is easily deceived due to knowledge being inappropriate or being misapplied [7].
In this sense, a machine learning model using Support Vector Machine (SVM) in conjunction with a computer vision system to assist in a faster and more reliable sorting process is proposed. A computer vision system uses an optical device such as a sensor or a camera and a processing system. The image capture is followed by an analysis process and, in general, algorithms for segmentation are used to find regions of interest and feature extractors. Thus, to build the learning space and making it possible to classify the image according to previously adopted criteria. Also, it is possible to establish well-defined sample classes, according to the judgment of specialists and the characteris tics to be identified.  Fig 1 shows the proposed architecture used to build the computer vision system, on the image acquisition stage the system captures the image. But on this paper, it doesn' t discuss how the image acquisition system works or how it was implemented, the focus was the processing and automatic quality inspection stages.
The database was collected from COFILAB 1 , which consist of two well-defined classes: citrus with stem, a collection of oranges in good maturity state and quality, and oranges infected with scale. The pre-processing stage is composed of three steps, background reduction, image filtering, and segmentation. After the segmentation, the learning space is built with the knowledge gathered fro m these steps.
The automatic quality inspection stage is composed by the use of the proposed classification model, which is SVM, a method based on machine learning theory. The feature vector built for the training was inspired by [8], using 64 colors features, 7 texture features, 8 shape features.

II. IMAGE ACQUISITION, PRE-PROCESSING, AND SEGMENTATION
The image acquisition was proposed using two datasets provided by COFILAB, Citrus with stem and Oranges infected with scale. Both datasets were created under the same circumstances [9], composed of a digital camera used to acquire high-quality images. At first, the images contained unnecessary information, like the background, as the research focused on the quality of the fruits, a background reduction was made. As the base is standardized by COFILAB, the reduction was a simple task, it starts with the use of Sobel filters to find contours 1 COFILAB: Computers and Optics in Food Inspection -http://www.cofilab.com/ and then a bounding-box is used to subtract the background. Fig. 2 shows the steps to achieve the background subtraction, the Sobel filters are applied to find the contours and a routine to find the most significant contours is used, a threshold of 25% of the total area is used to dismiss the small contours leading only to match the fruit in the image. After the find of the most significant contour a mask using the min and max of each axis, width, and height of the image, is used to build the bounding -box to apply into the original image extracting the fruit and reducing the background.

Fig. 2: The steps to achieve the background subtraction, in (a) the Sobel filter is applied, (b) apply mask inside the most significant contour region, (c) bounding box is formed by the min and max of each axis, (d) result image
After the background subtraction, image filtering is used to reduce the noise caused by sensors fault in the digital image acquisition. Vector median based filters or Gaussian filter are often used in this task, however, these classical methods tend to blur image edges and details possibly losing crucial information about the image. In order to mitigate the blurriness caused by linear filters, Peer Group Filtering (PGF) is applied [10] without losing information about edges, making the segmentation robust.
The segmentation routine was made using JSEG (JPEG image segmentation), an unsupervised segmentation of color-texture regions in images and videos [11]. The JSEG objective is to segment images and video into homogeneous color-texture regions, but to identify this homogeneity, three pre-set rules are necessary: -Each image must contain homogeneous colortexture regions; -Each region can be represented by quantized colors in it; -Colors between two neighbor regions are distinguishable.
The JSEG segmentation is formed by two steps, a color quantization, which performs a color reduction using a clusterization algorithm replacing the pixel value by its cluster color, generating a class -map, and a spatial segmentation is applied into the texture composition on the class-map.
Initial the color quantization is proposed during the image filtering process using the PGF, resulting pixels receive assigned weights, textured areas weights less than smoothed areas. CIELUV color space is used because its perception is uniform, the human eye senses changes in color better in uniform regions [12], and a General Lloyd Algorithm (GLA) creates the vector quantization of the pixel colors. The cluster's initial position for GLA is estimated by the popular splitting initialization algorithm. The weighted distortion D is given by: And the update rule is derived to be: where is the centroid of , ( ), and ( ) are the color vector and the perceptual weight for pixel , and is the total distortion for cluster .
At the completion of GLA, some pixels may have similar color values, causing the pixels to belong to different clusters, so an agglomerative clustering algorithm is used to merge clusters, minimizing the distance between them, parameterized by a threshold.
After color quantization, all necessary information for segmentation is saved into a class -map. The generated class-map, often called J-image, is the value of each pixel in its given class by its position in the image as a bidimensional vector (x,y), this value can be represented as J-value. Each point belongs to a class, using these spatial data the JSEG segmentation is proposed: Let Z be the set of all N data points in a J-image. Let z = (x,y), z ∈ Z, and m be the mean, Suppose Z is classified into C classes, , i=1,...,C. Let be the mean of the data points of class ,

Let and
Sw is the total variance of points belonging to the same class. Define the J-value as: In the case of images containing homogeneous regions, the more separated the classes will be resulting on a high value of J. In opposition if classes are uniformly distributed on the image the value of J tends to be small. Circular windows of various scales are used to determine possible regions in the image. The value J is calculated for each region obeying the window size and the mean of the values is given by: where Jk is J calculated in the region k, Mk is the number of points in region k, N is the total number of points in the class-map. Thus, the criteria for segmentation is to much an image region can be detected. Small size windows are useful to locate intensity and color edges, while large windows detect texture boundaries. Therefore, a region growing using seeds is necessary, it is followed by a region merging to give the segmented image, this parameter is controlled by the user, named scale factor. It was empirically analyzed that scale factor below value 10 fewer areas were detected and above 10 had no effect in to improve detection. So, with scale factor 10 was able to detect more areas, being healthy or unhealthy.
A threshold is used to establish how the seeds are created over the image, given by: where is the mean of the values that represent the homogeneity over the image and is the standard deviation, is a constant chosen from preset values that result in the number of seeds. Pixels with local J values less than are candidates to be a seed point, the connection used in the JSEG algorithm is the 4-connectivity, (x+1,y ), (x-1,y), (x,y+1), (x,y-1), where (x,y) is the position of the pixel.  As for post-processing, a color reduction is applied to the segmented areas to reduce the color information, the objective in this phase is to improve the color disparity between areas, enhancing possibles rotten areas and preserving healthy areas, the Fig. 4 shows a better visualization of the color reduction. SVM constructs a hyperplane, or a set of hyperplanes, in a space of high or infinite dimension, which can be used for classification or regression. A good separation is achieved by the hyperplane that has the largest distance to the trained know points closest to classes, Fig. 6 exemplify the problem to find the largest distance between the separable classes, this distance is called functional margin. In general, the larger the margin, the smaller the generalization error is obtained.
The margin can be determined by calculating the distance between any two points, one of each translational hyperplane, both located in the normal vector . Denoted by 1 and 2 the points in the vector belonging to the upper and lower hyperplanes, respectively, the margin is computed simply as the length of the line segment connecting 1 and 2 , that is, || 1 − 2 || 2 .
Given that the two vectors 1 − 2 and are parallel to each other, we can solve for the margin directly in terms of , as: The margin problem is extensively discussed in the theory of statistical learning. This discussion addressed the use of Kernels Machines where it explains the margin problem. The functions chosen were the most used in the literature, such as: -Linear Function; -Polynomial Function; -Radial Basis Function; -Sigmoid Function.

IV. PROPOSED METHOD
The proposed method uses an image processing routine described in Section II to process the input, and a feature space composed of 64 color features, 7 texture features, and 8 shapes features to create the feature vector. The initial dataset configuration was unbalanced, 125 images from the orange infected with scale, and 210 images fro m the citrus with stem, so a data augmentation procedure was used to balance the data sets.
The final configuration for the dataset was 300 images for each class. Operations like rotation, random noise, random crop, perspective-skewing, and elastic distortions were applied during the augmentation.   6 exemplifies the operations, the left side is the original images, the right side is the augmented images results, the images might be similar, but the features generated is completely different.
Since the images do not have the same size, to create the 64 color feature, a dynamic filter was created to output the 64 color characteristics, also the color features used the RGB color space and HSV in its construction. As part of the texture features, it uses the mean, the contrast, the homogeneity, the energy, the variance, the correlation, and entropy, based on sum and difference histogram measures proposed by Unser in [14]. The shape features or morphology based measures, the features used as the area, perimeter, Euler number of the object, convex, solidity, minor length, major length, and eccentricity. In total the feature vector is built with 79 dimensions. Normalization is applied to the feature vector to preserve the learning abstraction within all the features, the main objective in normalization is to change the dimension values in a uniform common scale.
Within the features vector built, the training process uses 70% of the data set and 30% for tests, both classes uniformly distributed in each process. The metrics chosen to evaluate the model was f1-score, accuracy and confusion matrix one of the most used metrics to evaluate pattern recognition models [1], [2], [8], [15].

Fig. 7: Flow chart of the proposed method
A cross-fold validation using 10 folds were applied in the training process. Fig. 7 illustrate the proposed method using a flow chart. Evaluating the model, a cross -validation methodology was applied using 10 folds. Cross -validation results in a less biased model because it ensures that every observation from the dataset has the chance of appearing in the training and test set [15]. It split the data into 10 sets of 60 images, in each iteration it uses 70% for training and 30% for the test, and each class is balanced among the folds. At the completion of each fold iteration, a set of metrics is proposed, using f1-score and accuracy to evaluate each fold, additionally at the end of the iterations a confusion matrix is created. This paper analyzes two color spaces in the creation of the color feature, the additive color space, RGB, and perceptually-uniform color space, HSV. In each color space, the SVM trains and generate the chosen metrics, in Fig. 8 the best classifier among the RGB color space utilizes a Linear Kernel generating a 97,3% accuracy. In Fig. 9 the classifier generated in the HSV color space uses a Radial Basis Function achieving a 94% accuracy.

VI. CONCLUSION
This paper proposes an image processing method, composed of image filtering, segmentation, and feature extraction, also presented an analysis of variations of the SVM for citrus fruit quality classification in which a very good result was observed with the color information feature represented in the RGB color space and with its linear kernel, obtaining a rate of 97,3% shown in Table 1, 3,3% higher than HSV color space radial basis classifier, seen in Table 2.

ACKNOWLEDGMENTS
I would like to thank the Federal University of Tocantins for the fomented funds in the execution of this research.