1. Introduction
Land use and land cover (LULC) information is essential for forest monitoring, climate change studies, and environmental and urban management [
1,
2,
3,
4]. Remote sensing techniques are widely used for LULC investigation because of their capability to observe land surfaces routinely on a large scale. The most often used remotely sensed data are optical images, such as those from Landsat [
5,
6,
7]. Synthetic aperture radar (SAR) images are also used for LULC classification because of their weather independence [
8,
9,
10,
11,
12]. Unlike optical data, which contain spectral information, SAR data characterize the structural and dielectric properties of ground targets [
13]. Combination of optical and SAR data results in a comprehensive observation of ground targets, and therefore, has a great potential to improve the accuracy of LULC classification [
14].
The potential of the combination of optical and SAR data has been increasingly explored for LULC classification, especially after the Sentinel mission was initiated by the European Space Agency (ESA) for Earth observation, which provides free-of-charge optical and SAR data [
15,
16]. Thanks to the weather independence of radar remote sensing, Reiche et al. [
17] improved forest mapping in a tropical region with heavy cloud coverage by fusion of optical and time series SAR imagery. Kussul et al. [
18] applied the multi-layer perceptron (MLP) classifier for crop mapping in Ukraine and achieved accuracies of over 90% for major crop types using multitemporal optical and SAR data. Zhang et al. [
19] reduced classification confusions among impervious surface, bare soil, shaded area, and water with fusion of optical and SAR images using RF classifier. Zhang and Xu [
20] concluded that fusion of optical and SAR data for LULC mapping may be classifier-dependent. They found that SVM and RF had a better performance than maximum likelihood classifier and artificial neural network when using multisource data.
Object-based image analysis (OBIA), together with advanced machine learning algorithms, has been widely used for LULC classification, as it can delineate object boundaries and produce compact classification maps [
21]. The spatial, textural, and contextual features extracted by OBIA have shown a great ability to boost the classification performance. For example, Wang et al. [
22] produced a global map of build-up area with hierarchical object-based GLCM textures derived from Landsat images and showed a 2.8% improvement on OA compared with that from only spectral bands. Ruiz Hernandez and Shi [
23] applied both GLCM texture metrics and spatial indices in a geographic OBIA framework with RF for urban land use mapping. Recently, Franklin and Ahmed [
24] applied RF and object-based analysis for tree species classification on multispectral images captured by unmanned aerial vehicles (UAVs). Many studies have concluded that object-based spatial and textural features can significantly improve the classification [
25,
26,
27,
28,
29,
30].
Pixel-based spatial and textural features, rather than object-based textures, have been concluded very useful for LULC mapping as well. Huang et al. [
31] found that pixel-based morphological profiles significantly outperformed object-based GLCM textures for forest mapping and species classification. Wang et al. [
32] tested the Completed Local Binary Patterns (CLBP) textures originally designed for face recognition and found that the textures were suitable for classifying wetland vegetation using SVM. With recent advances in machine learning, deep learning models (e.g., CNNs) have achieved great success in computer vision and pattern recognition. Like OBIA, CNNs learn spatial, textural, and contextual information from images, which have been concluded very useful for LULC mapping [
33,
34,
35,
36,
37,
38,
39]. Therefore, the integration of deep learning and OBIA is worth exploring.
Deep learning for remote sensing image classification can be classified into two categories [
40]. One is the conventional LULC classification, in which we obtain a single satellite image and then randomly collect some labeled samples on it. The other is semantic segmentation, in which we collect a set of fully annotated images from the same sensor and then train a CNN to classify new images without any annotation. Semantic segmentation is based on a special kind of CNNs, the fully convolutional networks (FCNs) [
41].
As FCNs do not require any annotation in the predicted image, they are extremely suitable for large-scale LULC classification [
35,
42,
43,
44,
45,
46]. Maggiori et al. [
35] proposed a CNN with a fully convolutional architecture using a two step pre-training method to produce large-scale building maps. Kampffmeyer et al. [
43] measured the uncertainty of FCNs and applied the median frequency balancing to adjust FCNs for imbalance classes and improved segmentation results of small objects. Yu et al. [
45] proposed a FCN with the pyramid pooling module to capture features at multiple scales, and thus achieved accurate segmentation for multiple ground objects. As semantic segmentation often leads to blurry LULC boundaries, Marmanis et al. [
46] designed a deep CNN that combined boundary detection and segmentation networks to delineate object boundaries.
FCNs have presented a great potential for large-scale LULC classification, but they highly depend on annotated images. For this reason, previous studies are often limited to some high-resolution benchmarks with only RGB channels, such as the ISPRS Vaihingen and Potsdam datasets. When the annotated maps are unavailable, for example, using Sentinel data for local climate zones (LCZs) classification [
2,
47,
48], it is really difficult to use these semantic segmentation models. One exception is from the study of Liu et al. [
49]. They successfully applied FCNs using training samples from a single image, where a training sample was the minimum bounding box of each object. However, pixels in the area outside the object and inside the bounding box must be manually labeled to boost the classification performance. Thus, patch-based CNNs are more suitable for LULC classification with single or few images than FCNs.
CNNs are originally designed for image recognition, and the input shall be a rectangular image [
49,
50,
51]. Zhao et al. [
52] applied a five-layer CNN to extract spatial features within an 18 × 18 window. Then, these features were combined with OBIA to produce the classification map based on the tan
h classifier. Zhang et al. [
50] carefully designed a novel object-based CNN to locate the convolutional center of an image object using the minimum bounding box. In their study, each image object was represented by a 128 × 128 image patch. However, objects delineated from satellite images vary widely in size. A large object results in a large minimum bounding box. A large-scale fixed representation may fail to capture small ground targets. For example, bridges on the water are very slender and a large part of the background is water, which might mislead a CNN to classify such image patch as water.
In this study, we present a novel yet simple method, namely object-based post-classification refinement (OBPR), to obtain object-based thematic maps produced by CNN using Sentinel multispectral and SAR data with very small input patches (e.g., 5 × 5). By using small input patches, small ground targets (e.g., high-rise buildings and roads) can be effectively captured. By post-classification processing, the classification maps are refined by object boundaries using majority voting. The proposed method was evaluated on two optical-SAR datasets and one hyperspectral dataset with diverse spatial resolutions. The three datasets are the Sentinel Guangzhou dataset with 10 m spatial resolution, the Zhuhai-Macau LCZ dataset with 100 m spatial resolution, and the University of Pavia dataset with 1.3 m spatial resolution. The remainder of this paper is organized as follows.
Section 2 introduces the study area and the datasets.
Section 3 explains the methodology, including details of CNNs and the proposed OBPR.
Section 4 presents the results and discussion. Conclusions are drawn in
Section 5.
4. Results and Discussion
The proposed method was evaluated on three datasets, two optical-SAR datasets with diverse spatial resolutions and one hyperspectral dataset. OBIA-SVM and OBIA-RF were selected as the benchmark methods. The experiments were conducted on a machine equipped with a 3.5 GHz Intel Xeon E3-1241 v3 CPU and 8G RAM.
4.1. Results on the Optical-SAR Sentinel Guangzhou Dataset
The classification results on the Sentinel Guangzhou dataset are shown in
Table 4. The proposed method achieved the highest classification accuracy, with OA of 95.33% and
of 0.94, considerably larger than those achieved by OBIA-SVM (OA of 90.22% and
of 0.89) and OBIA-RF (OA of 88.20% and
of 0.86). The classification accuracy (OA of 91.10% and
of 0.90) obtained by the standard CNN was already larger than those by OBIA-SVM/RF. This result indicates that the spatial information extracted by CNN was helpful in LULC classification. Among the LULC classes, urban LULC categories, especially new town and roads, were better classified using the proposed method. New town is well planned mid-rise to high-rise buildings. Both new town and roads are very small in the image, surrounding by complicated urban structure. The spatial information for these LULC classes thus is very important for their accurate classification. Therefore, the ability of CNN to extract spatial features considerably helped the classification task.
The proposed OBPR strategy remarkably improved the OA of CNN by 4.23%, indicating that the spatial constraint by object boundaries was very useful for LULC classification. When OBPR was combined with SVM/RF, the performance was as competitive as OBIA-SVM/RF (OA of 90.22% and OA of 88.20%), obtaining an OA of 89.73% and 89.64%, respectively. Most of the previous studies argued that the effectiveness of OBIA came from two aspects. One was that through OBIA we could obtain object-based classification maps. The other was that we could generate textural features from OBIA. Although the classification results in this study confirmed that OBIA-SVM/RF outperformed pixel-based SVM/RF, the superiority actually came from the spatial constraint that pixels inside one object should share the same label.
To evaluate the robustness of the proposed method, we constructed two subsets of training samples; the results are presented in
Table 5. The classification results with the subsets of training samples were consistent with those using 150 object samples per class. The proposed method obtained OA of 93.76% with 50 object samples per class, 4.65% and 7.26% higher than those by OBIA-SVM and by OBIA-RF, respectively. When only 10 labeled objects per class available, the proposed method significantly outperformed other classification algorithms, achieving OA of 89.81%, 6.22% greater than that of OBIA-SVM and 7.38% greater than that of OBIA-RF. The margin between OBPR-CNN and OBIA-SVM/RF enlarged when the training samples became limited.
Previous studies demonstrated that sufficient samples (at least 50 samples per class) are need to construct the classification system for remote sensing image classification. Otherwise the performance of classifiers will be significantly degraded. As only 10 labeled objects per class were available, the classification was not satisfactory with OBIA. However, 10 objects contained at least 163 pixels in our study (
Table 2). When the pixel samples were used, the number of pixel samples (163 per class) would be enough for classifiers to construct powerful classification systems. Therefore, we observed increases in OA of 3.03% and 3.62% from OBIA-SVM/RF to OBPR-SVM/RF.
4.2. Results on the Zhuhai-Macau LCZ Dataset
The second experiment was conducted on the Zhuhai-Macau LCZ dataset, and the results are shown in
Table 6. The proposed method outperformed other competitors in all cases. With full optical-SAR features, OBPR-CNN obtained OA of 77.64%, whereas the best non-CNN method OBPR-MLP only obtained OA of 70.94%, and the best OBIA method OBIA-RF achieved OA of 68.09%.
The best OA on this dataset was lower than 80%, indicating the complication of LCZ classification [
47]. One of the crucial problems is that different LCZs might have the same material and result in the same spectral information in the satellite imagery. Thus, spatial information is essential to distinguish them. The comparison between OBPR-CNN and non-CNN method (77.64% versus 70.94%) indicated the advance of CNN, and the comparison between the proposed method and OBIA (77.64% versus 68.09%) illustrated the effectiveness of OBPR.
4.3. Results on the University of Pavia Dataset
A popular hyperspectral dataset, the University of Pavia, was used to test the proposed method on high spatial resolution imagery. For this dataset, we randomly selected 5, 10, and 100 pixel samples per class, while the remaining samples served for validation. The image objects where these pixels lied in served as training samples in OBIA. All the pixels inside the training objects were used for training. In this manner, the OBPR strategy is in fact applying semi-supervised learning based on superpixels [
64]. The OAs are presented in
Table 7.
The proposed OBPR-CNN outperformed other methods among all sample sets. When training samples were sufficient (i.e., 100 per class), OBPR-CNN obtained OA of 96.32%, whereas the best non-CNN method OBPR-RF achieved OA of 94.90%. With the number of training samples decreasing, OBPR achieved OA of 95.70% using 10 sample per class and OA of 85.88% using five samples per class, whereas OBPR-RF obtained OAs of 78.82% and 67.28%, respectively. We can observe that OBPR-CNN was more superior when the samples were limited. This finding is contradictory to the common sense that deep learning models like CNNs need a large amount of training samples.
Not only OBPR-CNN obtained higher OAs than conventional OBIA methods, but also non-CNN OBPR methods outperformed OBIA methods. Take RF for example as it is less sensitive to noisy features. OBPR-RF consistently obtained higher OAs compared with OBIA-RF, e.g., 94.90% versus 89.92% using 100 samples per class. Such results illustrate that we should rethink of the OBIA strategy, as we can obtain classification maps with higher OAs with OBPR. As mentioned before, the strategy of OBPR is indeed one kind of semi-supervised learning, which is based on superpixels and can increase the number of training samples. This explains why OBPR outperformed OBIA.
4.4. Contribution of SAR Data to LULC Classification
Results of two optical-SAR datasets indicated the effectiveness of SAR data when it comes to LULC classification. To analyze such effects, the difference of producer’s accuracy (PA) and the user’s accuracy (UA) between optical-only and optical-SAR data on the Sentinel Guangzhou dataset is presented in
Figure 9. We found that the effects of SAR data depended on classifiers. When the RF classifier was used, the PAs and UAs of all the LULC classes were increased, especially for urban LULC types such as villas, roads, port areas, and new town. The improvement was less apparent when SAR data were combined with 10 m and 20 m optical data, but the differences of PAs and UAs of roads, port areas, and new town were still near 10%. Similar improvements were found in the classification results of SVM. The improvements made by CNN were not as significant as those using RF. The reason is that CNN can extract spatial information from the patch-based samples, whereas SVM and RF are pixel-based classifiers and lack the ability to make use of spatial features.
The contribution of SAR data to LULC classification was partly because of the side-looking imaging mode of radar remote sensing and the long wavelength of the C-band. The side-looking imaging mode and the C-band radar signals resulted in the low intensity of some ground targets, such as roads, which belong to the impervious surface and usually show high albedo in optical remote sensing images. In
Figure 10, the SAR backscatter from roads was low and provided different physical information beyond optical remote sensing. The classification maps in
Figure 10 show that with the addition of SAR data, the roads were identified accurately. Moreover, the classification maps produced by the combination of SAR and optical data presented minimal salt-and-pepper effects probably because the data from different sensors exhibited varied noises and the signals from radar remote sensing might have denoised the optical image.
In
Figure 11, the optical data suffered from shadow effects, which were severe in urban centers with city skylines. Most of the shadows were mistakenly classified as water, when the optical images were used alone, because the shadow effects were inevitable when a single data set from optical remote sensing was used. The radar backscatter from water was markedly lower than that from urban areas, due to the side-looking image mode of radar remote sensing and the complicated structure of the urban center. The differences in radar backscatter and textural features between water and urban areas could be extracted by CNN and resulted in accurate LULC classification, thus showing the advancement of the proposed OBPR-CNN.
4.5. Feature Importance of the Sentinel Optical and SAR Data
The importance of features estimated by RF is presented in
Figure 12 to illustrate the aforementioned conclusion that OBIA outperformed pixel-based algorithms mainly because it can obtain object-based thematic maps instead of utilizing textural features. The most important feature is the mean value of Band 12, which belongs to the middle infrared with a spatial resolution of 20 m. Other mean values of infrared spectral bands also played a significant role for LULC classification. This may be because the signals from infrared bands were less affected by atmosphere and were informative for LULC mapping. The mean values of 10 m spectral bands and the SAR backscatters (VH and VV polarizations) were crucial for classification as well. However, the GLCM textures were not important (less than 2%) for classification as estimated by RF. The importance of features illustrated that GLCM textures were not as important as object boundaries for LULC classification using Sentinel optical and SAR data. Instead of using hand-crafted features, CNNs can learn spatial features automatically, which were optimized by back-propagation and had a better performance than hand-crafted GLCM textures.
4.6. CNN as Feature Extractor
The power of CNN lies in its capability to extract spatial features and fuses the spatial-spectral features into a high-dimensional feature space where the classifier can well distinguish the different classes. If the extracted features serve as input of SVM and RF, then the results of SVM and RF should be as competitive as those using CNN with the softmax classifier. As shown in
Table 8, CNN-RF and CNN-SVM represent classification results by RF and SVM based on spatial-spectral features extracted by CNN. Notably, CNN indicates classification results based on the softmax classifier. The OAs using CNN as feature extractor for SVM, RF and softmax are competitive. Interestingly, the best OA of each dataset, including optical-only data and optical-SAR data, was obtained by RF. This may reflect the excellent generalization of RF that it can handle noisy and thousands of input features without feature selection.
4.7. Sensitivity Analysis
4.7.1. Sensitivity Analysis of the Segmentation Scale
We have conducted experiments to analyze the sensitivity of the segmentation scale of OBPR. The results are presented in
Figure 13. For the Sentinel Guangzhou dataset (
Figure 13a), the performance of OBPR is effective. The optimal scale lies in the range of 40 to 80, whereas the average number of pixels per object varies approximately from 80 to 200 (8000–20,000 m
). When the scale is very small and each object contains very little pixels, the improvement made by OBPR is limited yet observable. When the scale is very large (e.g., greater than 150), the performance of OBPR is degraded. Nevertheless, the improvement by OBPR is stable and effective as the scale of 30 to 120 is quite wide and safe. From the segmentation images (
Figure 14). We can observe that a scale of 30 produces a very fragmented segmentation, whereas a scale of 120 leads to under-segmentation. With a heuristic process, one can easily find a proper segmentation scale in this range.
For the Zhuhai-Macau LCZ dataset (
Figure 13b), the improvement by OBPR is not as effective as that of the Sentinel Guangzhou dataset. The effective scale varies from 15 to 30 (0.1–0.4 km
per object). Since this dataset is with a very low spatial resolution (100 m), it might not be suitable for object-based classification.
The result of the University of Pavia dataset (
Figure 13c) confirms that OBPR can obtain very satisfactory performance for high spatial resolution imagery. For classification maps by CNN and RF, OBPR can consistently improve the OA. For the classification map produced by SVM, OBPR degrades the result after the scale reaches to 60. A segmentation scale of 60 is extremely large, as each object contains almost 2000 pixels and the whole image is segmented to no more than 200 objects. An OA of 63.04% is quite low. When applying the majority voting strategy, incorrect classification could lead to a larger error. Nevertheless, with reasonable heuristic processing, it is easy to find a proper segmentation scale. The sensitivity analysis indicates that OBPR is less sensitive to the scale. OBPR can be very effective in a wide range of the segmentation scale with high to medium spatial resolution imagery.
4.7.2. The Choice of Three Majority Voting Strategies
The results of three choices of majority voting strategies are presented in
Figure 15. From the left side of
Figure 15 we can observe that that the choice of the majority strategy has very limited effects (less than 0.5%) on OA. The result is expected because many pixels were present inside an object. Only a few objects encountered the situation in which at least two major labels were detected (right side of
Figure 15). In addition, the randomness of the dominant LULC type can ease the problem.
4.7.3. The Effect of Patch Size
Classification maps of diverse patch sizes are shown in
Figure 16. A large patch size (35× 35) results in inaccurate classification of small roads between mulberry fish ponds, whereas a small patch size (5× 5) better captures small objects in the image. In addition, using a small patch size is computationally efficient, providing users an opportunity to apply deep learning models on their personal laptops without expensive GPUs.
4.7.4. Number of Trees in RF
Previous studies have a wide discussion on the optimal number of trees when using RF classifier [
65]. We tested the RF classifier on the optical-SAR Sentinel Guangzhou dataset with the number of trees in the range of [20, 500] (
Figure 17). The classification accuracy is insensitive to the number of trees as pointed out by Du et al. [
59], especially after it is up to 60. In addition, OBPR significantly outperforms OBIA regardless of the number of trees.
5. Conclusions
In this study, we developed a new method that equips CNNs with the ability to produce object-based thematic maps for LULC classification. Compared with other three methods, the proposed method OBPR-CNN can present promising results with limited labeled samples. Our method was tested on three datasets with diverse spatial resolutions and different classification systems. It obtained a remarkable result with OA of 95.33% and of 0.94 on the Sentinel Guangzhou dataset and a satisfactory result with OA of 77.64% with limited and imbalanced labeled samples on the Zhuhai-Macau LCZ dataset using Sentinel multispectral and SAR data. Our method also achieved a very competitive result (OA of 95.70%) on the popular hyperspectral dataset the University of Pavia with only 10 labeled samples per class. Such results outperformed traditional OBIA methods.
Through further studies, we found that object-based GLCM textures were less important for LULC mapping in this study. The performance of OBIA mainly lies in its capability to produce object-based classification maps rather than generating textural features. The hand-crafted GLCM textures were less superior than those learned by CNNs. Therefore, OBPR-CNN is better than OBIA to obtain object-based thematic maps. The combined use of optical and SAR data depended on classifiers. When CNNs were used, the addition of SAR data had limited improvement for LULC mapping, whereas the addition of SAR data played a significant role in distinguishing urban ground targets using one-dimensional classifier, i.e., SVM, RF and MLP. This study is the first to evaluate the performance of optical and SAR data using CNNs. From the results, we may conclude that in the era of deep learning, spatial information extracted by CNN is more crucial for LULC mapping than the combined use of optical and SAR data. Nevertheless, the addition of SAR data and the spatial information extracted by CNN helped distinguish urban LULC classes such as roads, new town, and port areas.
Future studies may explore high spatial resolution SAR imagery (e.g., TerraSAR-X) using the proposed method. The fusion of multimodal, multisource, and multitemporal data for complicated classification tasks such as LCZ classification is worth investigation as well.