Comparison between Quantitative and Qualitative Theme-Feature Forest Biomass Estimation Models built over SAR Data

— International organizations are still in need for methodologies that accurately measures forests above ground biomass (AGB). Among the remote sensing technologies, those of Synthetic Aperture Radar (SAR) stands out in the modeling of forest biomass due to their ability to characterize the geometry of the imaged region. The semantic representation, through thematic maps, is one of the main means for the geospatial situational understanding. However, there is a gap of knowledge for models that are built by the analysis of quantitative and qualitative theme-feature in a complementary way. This article aims to develop and compare forest biomass estimation models, through an innovative methodology, over quantitative and qualitative theme-features. To this end, extracted SAR data and specific machine learning (ML) and feature selection techniques are applied for each case. The models developed are based into forest inventories with 128 plots located in two different Brazilian Amazon Forest areas and were built over 231 extracted independent variables. The methodology applied used techniques to categorize numeric data and, afterwards, comparatively evaluate numeric quantitative and categorized qualitative results. The constructions of the models were based on ML algorithms such as Multilayer Perceptron, Suport Vector Machine and Random Forest. The results showed that the different study areas had very different vegetation characteristics, significantly impacting the feature selection and ML algorithms. The different biomes of the Amazon Forest and their respective characteristics demanded specific models and techniques, not fitting into a single pattern.

Among the remote sensing technologies, those of Synthetic Aperture Radar (SAR) stands out in the modeling of forest biomass due to their ability to characterize the geometry of the imaged region [1,2,6,[8][9][10][11][12]. It also allows the monitoring and the verification of the type, direction, intensity and extent of the degradation in different areas, caused by human influence or by natural forest fires [6,[13][14][15][16]. Due to the good results obtained by researchers, new projects that aims to use SAR data to estimate biomass are under execution or planning [6]. The Japan Aerospace Exploration Agency (JAXA) project, ALOS PALSAR 2, has been underway since 2014 and is a source of significant data for recent researches [14,[17][18][19][20].
In Brazil, among the projects that aims to generate SAR images and that can be used in biomass estimation, the Amazon Radiography Project developed by the Geographic Service of the Army (DSG) stands out. By 2022, a total area of 1,800,000 km 2 of the Amazon region will be covered with airborne sensors in the X and P bands [21]. In addition to the 1:50,000 scale mapping, the project also has the potential to generate data to support infrastructure projects and sustainable exploitation of natural resources in the region [22][23][24].
Due to the large amount of data that can be originated from available SAR sensors, it is necessary to apply techniques that aims to organize and analyze quantitative and qualitative features in an intelligent and automated way [20,[25][26][27]. Machine Learning -ML techniques are able to model knowledge and make associations between different types of quantitative or qualitative information [28][29]. According to [30], the main advantages of ML are accuracy, since the optimal algorithm is selected from the characteristics of the data and the problem to be solved; automation in learning, which adjusts the models according to the success or failure of the results; processing speed; customization, being suitable in any type of problem; and scalability, as they are processes that adapt to data growth.
One of the possible applications in ML is the development of models involving thematic issues and those resulting in qualitative theme-attributes [28][29]. In these cases, the theme-attribute is commonly used for the construction of thematic maps that includes different areas of human geography, from the spatial representation of health and social geography [31][32][33], to characteristics related to forest biomass stocks [2,[12][13][16][17][18].The semantic representation, through thematic maps, grows in importance, being one of the main means for the geospatial situational understanding and, consequently, the implementation of public administrations [34][35].
Recent published researches referring to biomass estimation presents ML originated models which output results are quantitative theme-attribute, that is, numerical [1,16,[18][19]. However, studies that builds and analyzes simultaneously quantitative and qualitative themeattributes models were not observed. Therefore, it is necessary researches that seeks to cover this gap of knowledge and that aims at building thematic maps models using, in a complementary way, quantitative and qualitative theme-attributes.
This article aims to develop and compare forest biomass estimation models built over quantitative and qualitative theme-feature based on extracted SAR data. To this end, machine learning and feature selection techniques are specifically selected and applied for each case.

Study Area and data
The study areas are located in different geographical regions of the Brazilian Amazon rain forest: São Gabriel da Cachoeira (SGC), a municipality located on the banks of the Rio Negro, in the northwest of the state of Amazonas; and the Unini River Extractive Reserve (Unini River ExRes) located in the Unini River basin, in the municipality of Barcelos. The areas, in white, are highlighted in Figure 1, together with the location of some of the inventoried plots, in green.  The areas were selected for two reasons: the distinct phytoecological and land use and occupation situations and the availability of data. The SGC area has hybrid characteristics, composed of anthropized regions together with dense vegetation. In contrast, the Unini River ExRes area is composed only of primary virgin forest vegetation.
According to [31], the vegetation found in the study areas is of forest formation. More specifically, [32] indicates that the vegetation found in the São Gabriel da Cachoeira area is composed by phytoecological forest contact / edaphic formations regions (campinaranas). These regions are characterized in three ways: (1) dense, submontane forests with dissected relief. [32] states that the average AGB volume in the area is 107.4 m 3 /ha; (2) dense, submontane and undulating forests; and (3) dense forests, lowlands and relief with the presence of plateaus.
The Unini River ExRes, in its turn, is an extractive conservation unit with about 833 hectares in length and characterized in [32] as: (1) dense tropical forest, referring to the sub-region of the low plateaus of the Amazon; and (2) areas of ecological tension with dense alluvial presence.
The remote sensing data was obtained from the ALOS PALSAR 2 sensor and the Amazon Radiography Project. The working areas are comprised between 0° and 1° south latitudes and 67° and 68° west longitudes, for the region of São Gabriel da Cachoeira; and between 1° and 2° south latitudes and 62° and 63 ° west longitudes, for the Unini River ExRes.
The data from ALOS PALSAR 2 were provided by IBAMA and are Level 1.1 -Single Look Complex (SLC) processing images in the quadri-polarized strip-map imaging mode.
(1) amplitude orthoimages in X band HH polarization and P band quadri-polarized, all with 16 bits radiometric resolution and 5 meters spatial resolution; (2) digital surface models (DSM) and digital terrain models (DTM) generated, respectively, from the interferometric processing of X and P data, with 32 bits radiometric resolution and 5 meters spatial resolution.
The AGB data were provided by the National Institute of Amazon Researches -INPA, and follow the methods developed by [33] and described by [34]. In addition to the exact same geographical position as the images, the proximity to the region's imaging date was also important as it aims to avoid major changes in the analyzed vegetation.
The given biomass data provided was composed of 128 inventoried plots, 58 plots of São Gabriel da Cachoeira and 70 of Unini River ExRes, presenting the AGB values (ton/ha) and the UTM coordinates of the start and end points of each plot. As pointed out by [35][36], different allometric equations were used to calculate the inventoried plots due to the characteristics of the region. Figure 2 illustrates the format, the start (P1) and end (P2) points and the arbitrary coordinates of each arboreal individual within the plot.

Methodological approach
The research was structured according to the flowchart shown in Figure 3. Each step is described in the following subitems.

Forest Biomass Data Processing
Using analytical geometry techniques, the UTM coordinates of each 4 corners of the inventoried plots were calculated and the respective vector files for each region of interest (ROI) were generated.

SAR Data Processing
In this stage, the ALOS PARSAR 2 images, obtained in SLC format, were processed and the features on the available X, L and P bands were extracted. All processing steps were performed using the Polarimetric SAR Data Processing and Educational Tool (PolSARpro), version 6.0 (Biomass Edition), from the European Space Agency (ESA).
The ALOS PALSAR 2 images were processed according to the flowchart shown in Figure 4. The following parameters were used: • multilook processing with 2 looks for the rows and 1 look for the columns, as suggested by [19]; • Lee Refined speckle filter with 2 looks and 7x7 size window; • calculation of the covariance [C] and coherence [T] matrices images, both 3x3; • geocoding of the coherence matrix image [T], performing the correction of the Range-doppler terrain and the respective georeferencing using the digital elevation model automatically extracted from the Shuttle Radar Topography Mission (SRTM), with 90m spatial resolution; • polarimetric calibration and conversion to sigmanought (σ 0 ) using Equation 1, where the DN is the Digital Number, in amplitude, and CF is the calibration factor in dB for the channels [37]. The value applied for the CF was -83; and • application of target decomposition techniques.
At the end of the SAR data processing, the interferometric, incoherent and coherent features were extracted according to Table 1.

Hint
Interferometric height -It is the difference in altitude between the Digital Surface Model (MDS), obtained with the X band, and the Digital Terrain Model (MDT), obtained with the P band. It represents the height of the vegetation.

Decliv
Declivity -It is the slope of the land surface in relation to the horizontal, obtained through the MDT.

Incoherent SAR Features
Xhh Amplitude image of the X band in the HH polarization -The backscatter of the forest canopy.

Lhh, Lhv, Lvv
Amplitude image of the L band in the polarizations HH, HV or VV -Represents the main geometric characteristics of arboreal individuals.
Phh, Phv, Pvv Amplitude image of the P band in the polarizations HH, HV or VV -Associated with the main geometric characteristics of the terrain.

Pvv-Phv
Subtraction between amplitude images in the P band polarizations.

PC1L, PC2L, PC3L
Principal Components of the amplitude images in the L bands polarizations.
CR_L, CR_P Ratio between crossed polarizations (Crossed Ratio -CR) in the L or P bands (CR_Band = Band_hv / Band_hh) -Referring to the volumetric backscatter of the target.

TotPow_L, TotPow_P
Total power of the L or P bands (TotPow_Band = Band_hh + Band_vv + 2 * Band_hv) -They represent the sum of all backscatter mechanisms occurring in the forest.

VSI_L, VSI_P
Volumetric scattering index in the L or P bands (VSI_Band = Band_hv / (Band_hv + BMI_Band)) -Related to the density of the canopy, being directly proportional to the amount of elements that cause multiple type scattering.

Haralick Textural Features [41]
The co-occurrence texture features analyzes the relationship between pixel pairs values within a window and constructs a Grey Level Co-occurence Matrix (GLCM). In the texture equations, P (i, j) is the cooccurrence probability of each pixel value in column i and row j; Ng is the number of distinct grey levels in the quantized image; µ is the average value of P; σ is the x or y deviation pattern of the image.

J_Me_Band
Mean ( ) is the intensity difference between the reference pixels and its neighbors in the GLCM.
J_Di_Band Dissimilarity ( is the amplitude difference between the reference pixels and its neighbors in the GLCM.
J_En_Band Entropy ) value represents the randomness between the elements of the GLCM J_Se_Band Second Moment ( is the second angular moment between the elements of the GLCM.
J_Cor_Band Correlation is the statistical difference between the reference pixels and its neighbors in the GLCM.

TPsi_Sm
Orientation angle (ψ) -Associated with the target's angle of inclination.

Data Structuring
The data extracted from SAR and the AGB data were organized in a single structured spreadsheet, having the features represented in columns and the instances, referring to each inventoried forest biomass plot, as rows.
The AGB feature was defined as the theme-feature (or "result" or "output" feature) of the structured spreadsheet.
For each of the extracted features, the arithmetic mean of the pixels' value corresponding to the areas of the inventoried AGB plots was calculated.
The numerical data was used in two different ways. First, using the original values of the explanatory feature set x = (x1, x2, … , xp) T , so that the multiple regression model would be as shown in Equation 2. Second, with the logarithmic of the original value, as Equation 3. In all cases p is the number of variables, β = (β0, β1, , βp) T is the parameter set, y is the dependent AGB variable and ε is the random error.

Categorization
The numerical data of the AGB quantitative feature were categorized and associated with one of the 5 (five) categories of biomass: "Low", "Medium-Low", "Medium", "Medium-High" and "High". The categorization methods, used to transform quantitative to qualitative features, were of the equal intervals and of the quantile.
According to [47], the method of equal intervals is performed by dividing the theme-feature values in the domain range by the number of categories of interest. In Equation 4, K is the number of categories defined by the user, xmin and xmax, respectively, the minimum and maximum values observed in the theme-feature and δ the value of the widths for each category interval.

δ=(xmaxxmin) /K (4)
In the quantile method, categorization is performed by dividing the total number of instances N by the number of categories of interest K. Therefore, at the end of this method each category will have the same number of objects.
At the end of the categorization stage, the themefeature was classified in one of three possibilities: numeric (NumThFe), categorical by the "equal intervals" method (EqIntThFe) and categorical by the "quantile" method (QuThFe). Then, all other steps were performed for each of these cases.

Feature Selection
Tests were performed using the filtering type feature selection, in comparison to the exhaustive search including all features extracted from SAR data. The objective was to verify the impacts of the feature selection process on the quality of the final AGB models developed.
The feature selection technique performed was the Correlation-based Feature Subset (CFS) Selection, as described [48]. In this case, the search method used was the greedy Best First, which performs the "hill climb" heuristic in the "forward" direction.
According to [49], the CFS feature selection method is adequate to identify features that are related to the AGB by using the Pearson correlation coefficient method.

Modeling
In the specific cases in which the constructions of the models were based on numerical quntitative data, that is, when the theme-feature has not been categorized, the methods of simple statistical regression -SR and multiple statistical regression -MR were used. On the other hand, for the specific cases of the qualitative categorized data, the methods of logistic statistical regression -LR and ordinary decision tree -ODT were applied.
In addition to these methods, the Multilayer Perceptron -MLP, Suport Vector Machine -SVM and Random Forest -RF methods were used for all cases.
The feature selection and the model development steps were carried out entirely in the WEKA (Waikato Environment for Knowledge Analyzes) system, version 3.8.4, and followed algorithms described by [50].

Development and Evaluation of a Biomass Estimation Model
After the development of the models, the evaluation stage is carried out. In the case of the models based on numerical data, such as those of statistical regression, there are several parameters that can be observed and that reflects the assessment. The parameter used in this case was the correlation coefficient (r), described by [51].
In the case of the models based on categorized qualitative data, the assessment was made by building a confusion matrix and calculating the respective Kappa coefficient of agreement [52]. Due to the reduced number of instances, the process of cross-validation divided into 10 folds was used, as suggested by [53].

Comparative Analysis between Biomass Estimation Models
Initially, the selected models were those that obtained the best correlation coefficient, in the case of the numerical quantitative data, and best Kappa coefficient, for the models based on categorized qualitative data.
In order to compare those different type of models, the numerical values resulting from the AGB will follow the process described in the flowchart presented in Figure 5. In this process, numerical quantitative values will be categorized using the equal intervals method, followed by the assessment obtained through the construction of the confusion matrices and calculations of the respective Kappa coefficients.

Forest Biomass Data Processing
From the AGB data granted by INPA, 3 sample sets were defined according to the region inventoried: São Gabriel da Cachoeira, Unini River ExRes and the joint regions. The statistics for each set, referring to the number of pixels and AGB in each plot, are shown in Table 2.

SAR Data Processing
Together with the features detailed in Table 1, the textural features were extracted for all available polarimetric bands, that is, Xhh, Phh, Phv, Pvv, Lhh, Lhv and Lvv, for 3x3, 5x5 and 7x7 window sizes.
At the end of the SAR data processing, 231 features, or independent variables, were extracted, in addition to the theme-feature.

Categorization
The categorization by the equal intervals technique obtained a δ of 52 (t / ha). Therefore, the AGB categories were defined as: Low (below 100 t/ha); Medium-Low (between 100 and 200 t/ha); Medium (between 200 and 250 t/ha); Medium-High (between 250 and 300 t/ha); and High (above 300 t/ha). The number of categorized instances was 2 (two) for the Low class, 38 (thirty-eight) for Medium-Low, 42 (forty-two) for Medium, 40 (forty) for Medium-High and 6 (six) for High.
The categorization by the quantile method obtained 25 (twenty-five) or 26 (twenty-six) instances for each category.

Feature Selection
The process was carried out separately for numerical quantitative and categorized qualitative data. The results of the 5 (five) selected features, in decreasing order of relevance, are shown in Table 3. In the same table Pearson's correlation values between the selected feature and the respective theme-feature, quantitative or qualitative, was calculated.
In general, the selected features showed low correlation with the biomass theme-feature. The highlight was the Hint feature, which achieved a good correlation with the quantitative data, in addition to being selected for both cases.

Development of Biomass Estimation Models
The ML techniques applied in the biomass estimation modeling had the following specific configurations: (1) SVMthe model applied to numerical quantitative data was the SMOreg, specific for statistical regression, as described by [54]. The complexity parameter c was 1.0 and the Radias Basis Function (RBF) kernel used 0.01 gamma; (2) MLPthe models not submitted to the feature selection process were built with one (composed of 50 nodes) or two (composed of 50 and 10 nodes) hidden layers. The models submitted to the feature selection process were built with one (composed of 5 nodes) or two (composed of 5 and 5 nodes) hidden layers; (3) RFthe parameter of 100 trees was used in the construction of the model; (4) ODTthe minimum quantity of 2 instances per node was applied.
The correlation and kappa coefficients resulting from the tests are shown in Tables 4, 5, 6 and 7 and have the following characteristics: (1) Tables 4 and 5 refers to models based on numerical quantitative and Tables 6 and 7 to models based on categorized qualitative theme-features; (2) Tables 4 and 6 refer to the original values and Tables 5 and 7 refer to log values of the features ; (3) the values before the bars (/) are those obtained by models that have not been submitted to the feature selection process, while the values after the bars are those referring to models with selected features; (4) the results in MLP models with an asterisk (*) are those obtained with 2 (two) hidden layers and that obtained results superior to those of a single hidden layer; (5) the results in bold are the best obtained, having been highlighted 2 (two) results for each type of region and for each type of data (quantitative or qualitative).

Comparative Analysis between Biomass Estimation Models
As observed in Tables 4, 5 In the case of the numerical quantitative themefeature, presented in Tables 4 and 5, only the MLP and SR techniques showed outstanding results. The MR technique was not able to increase the r from the input of new features.
The models developed for the categorized qualitative theme-feature, Tables 6 and 7, showed an increase in results for non-parametric techniques, including MLP, RF and ODT.
The models submitted to the feature selection process showed improvement in 73% of the numerical quantitative theme-feature cases. In these cases, only 10% worsened the results, all of which refers to the SVM technique.
On the other hand, for the case of categorized qualitative theme-feature submitted to the feature selection process, the percentages of improvement, worsening and maintenance of the results were, respectively, 35%, 10% and 55%. In this case, there was no correlation to the ML technique.
Regarding the categorization method, all the best results were obtained using the method of equal intervals. Despite this, considering all cases, there was not a conclusive difference in the results between the categorization methods.
The different areas analyzed also presented different results. For the case of the numerical quantitative themefeature, the São Gabriel da Cachoeira region obtained the best results, unlike the region of the Unini River ExRes with the worst results. The opposite result was obtained for the case of the categorized qualitative theme-feature. In both cases, the results for the joint regions, as they aggregate data from both study areas, were average.
In order to carry out the comparative analysis, the process shown in Figure 5 was applied. The comparative analysis was performed on data from the same regions (Joint Regions, SGC or Unini River ExRes), separately for quantitative or qualitative data. The results obtained are shown in Tables 8,9,10,11,12 and 13. In all cases, 3 (three) types of Z hypothesis tests were performed, with a significance level (α) of 0.05: In order to carry out the comparative analysis, the process shown in Figure 5 was applied. The comparative analysis was performed on data from the same regions (Joint Regions, SGC or Unini River ExRes), separately for quantitative or qualitative data. The results obtained are shown in Tables 8,9,10,11,12 and 13. In all cases, 3 (three) types of Z hypothesis tests were performed, with a significance level (α) of 0.05: • test to analyze the hypothesis of Kappa * (value referring to the first selected model) being equal to zero; • test to analyze the hypothesis of Kappa ** (value for the second selected model) to be equal to zero; • and test to analyze the hypothesis whether the difference between Kappa * and Kappa ** is significantly greater (or lower) than zero, that is, if both are significantly different.  [55].

IV. CONCLUSION
The present work aimed to develop and compare forest biomass estimation models, from different regions of the Amazon forest, built over numerical quantitative or categorical qualitative theme-feature. For this, ML techniques were applied on polarimetric and interferometric X, L and P bands SAR data extracted features, generating models that were analysed and compared.
In an innovative way, the work presents a methodology that involves: • the process of feature selection and AGB estimation models development over quantitative and qualitative theme-features. It is noteworthy that, for each case, the feature selection and ML techniques were specific and configured in order to obtain the best results; • comparative analyses between quantitative and qualitative results. In this case, the post-modeling categorization process and the respective confusion matrices construction was performed, followed by the comparison using hypothesis tests.
The results showed that the different study areas had very different characteristics, significantly impacting the feature selection and ML algorithms. The SGC area, due to the greater variation in AGB inventoried values (between 92.21 and 351.73 t/ha), obtained better results with the numeric quantitative theme-features. On the other hand, Unini's River ExRes area, that had AGB values with less variation (between 153.32 and 311.57 t/ha), was better suited to categorized qualitative data modelling.
The different biomes of the Amazon Forest and their respective characteristics demanded specific models and techniques, not fitting into a single pattern. This conclusion is in agreement with the research of [2] who affirms that the heterogeneity of tropical forests is one of the main factors for the increasing uncertainty regarding the biomass stocks measurement in the region.
The process of feature selection was unanimous in selecting the interferometric height (Hint) as the most relevant feature for all areas of study, both in the case of qualitative and quantitative theme-features, in agreement with the results obtained by [23][24][56][57]. Likewise, there was an emphasis on features obtained by target decomposition techniques on the L band, from the ALOS PALSAR 2 sensor. The textural features, on the other hand, did not show significant correlation with the AGB values, different from the results obtained by [58].
As a conclusion of the presented methodology, there was no significant improvement in the AGB estimation process, since the results obtained from Kappa varied between fair and moderate. Likewise, the post-modeling categorization process did not achieve the expected results, keeping the Kappa value stable and not being able to generalize the AGB values into categories. The result obtained may have occurred due to the low correlation between the biomass theme-feature and the extracted SAR features.
In order to develop more suitable AGB models for different regions of the Amazon Forest, further studies will be carried out aiming to adjust the training parameters of ML techniques. In this case, the possibility of applying search methods and deep learning, commonly used in the Artificial Intelligence area to define such parameters, will be verified.
Analysing the possible reasons that led to the limited results, two factors were identified that may contribute to new research in the area in focus.
The first factor refers to the inventoried forest management plots used as samples. In agreement with the quoted by [59][60][61][62][63][64][65], a large number of plots, including areas with greater variations of AGB values, allows a more reliable sample representation and more in-depth statistical analysis.
The second factor is related to the processing of SAR data and the possibility of extracting new polarimetric and interferometric features. Accessing data in SLC format of polarimetric X and P bands would enable the extraction and analysis of the respective target decomposition features. Likewise, through the construction of a digital elevation model in the L band, it would be possible to obtain new interferometric heights involving the