Skip to main content
  • Research Paper
  • Open access
  • Published:

Combining Weibull distribution and k-nearest neighbor imputation method to predict wall-to-wall tree lists for the entire forest region of Northeast China

Abstract

Key message

We propose a coupled framework to combine the strengths of the Weibull function in modeling diameter distributions and the ability of the k-nearest neighbor (kNN) method to impute spatially continuous forest stand attributes for the prediction of wall-to-wall tree lists (lists of stems per hectare by species and diameter at breast height (DBH)) at regional scales. The tree lists of entire Northeast China’s forests predicted by the above framework reasonably reflect the species-specific tree density and diameter distributions.

Context

Detailed tree lists provide information about forest stocks disaggregated by species and size classes, which are crucial for forest managers to accurately characterize the current forest stand state to formulate targeted forest management strategies. However, regional tree list information is still lacking due to limited forest inventory.

Aims

We aimed to develop a coupled framework to enable the prediction of wall-to-wall tree lists for the entire forest region of Northeast China, then analyze the species-specific diameter distributions and reveal the spatial patterns of tree density by species.

Methods

A two-parameter Weibull function was used to model the species-specific diameter distributions in the sample plots, and a maximum likelihood estimation (MLE) was used to predict the parameters of the Weibull distributions. The goodness-of-fit of the predicted species-specific Weibull diameter distributions in each plot was evaluated by Kolmogorov-Smirnov (KS) test and an error index. The kNN model was used to impute the pixel-level stand mean DBH.

Results

Weibull distribution accurately described the species-specific diameter distributions. The imputed stand mean DBH from the kNN model showed comparable accuracy with earlier studies. No difference was detected between predicted and observed tree lists, with a small error index (0.24–0.58) of diameter distributions by species. The fitted species-specific diameter distributions generally showed a right-skewed unimodal or reverse J-shaped pattern.

Conclusion

Overall, the coupled framework developed in this study was well-suited for predicting the tree lists of large forested areas. Our results evidenced the spatial patterns and abundance of tree species in Northeast China and captured the forest regions affected by disturbances such as fire.

1 Introduction

As knowledge of the total stand volume, aboveground biomass, or basal area is insufficient, forest managers cannot accurately characterize the current stand state, project the future growth and yield, formulate the harvest schedules, and prioritize the stands for treatments (e.g., thinning or regeneration harvest) (Mauro et al. 2019). Accordingly, detailed information about forest stocks disaggregated by species and size class is of primary importance for effective forest resource management (Brosofske et al. 2014). Notably, detailed tree lists [lists of stems per hectare by species and diameter at breast height (DBH)] (Temesgen et al. 2003) can be used to satisfy the above requirements. Therefore, the prediction of spatially continuous tree lists is urgently needed for these evaluations and is an important output of a complete forest inventory.

Tree lists can be used as inputs for tree and stand projection models (growth and yield models) to predict tree growth (Lamb et al. 2018) and assess stand-level structures (Temesgen et al. 2003). In particular, regional tree lists provide key information for the initialization of forest landscape models (e.g., LANDIS PRO) to simulate the long-term effects of succession, climate change, and disturbances (e.g., fire and pests) on forest composition over large areas (Duan et al. 2022; Huang et al. 2021). Therefore, spatially explicit tree lists are required to develop tactical forest management in response to various disturbances. The generated maps of wall-to-wall tree lists can be used as valuable indicators by forest managers to intuitively understand the species-specific density distribution at different DBH classes, evaluate the necessity of thinning, and plan the location and intensity of the next selective cuttings (Valbuena et al. 2014).

The collection of detailed tree list information contiguously across large forest regions is often not practical as it is time-consuming, labor-intensive, and spatially constrained (Zhang et al. 2019). In Northeast China, forests cover approximately 45 million hectares, and most forested lands are located in mountains with few roads. Due to the large forested land base and difficult access, detailed tree lists are only available for limited sample stands or plots. Stand-level inventory information, such as stand DBH, stand height, and species composition, collected from the national forest inventory (Zhang and Liang 2014), is available for thousands of stands in each forestry bureau. This information can be imputed to large areas by integrating remotely sensed data with complete spatial coverage (Fu et al. 2019; Zhang et al. 2018a). Therefore, an emerging need exists to develop a cost-effective methodology that integrates forest inventory data, remotely sensed data, and other ancillary data sets to predict wall-to-wall tree lists for forest regions based on limited sample plot data.

Tree lists can be approximated using species-specific diameter distributions for mixed forest stands (Brosofske et al. 2014). Therefore, many studies have used stand- or plot-level information measured on the ground to generate tree lists using a parametric diameter distribution modeling approach (Bankston et al. 2021; Olofsson and Olsson 2018). This approach usually uses a probability density function (PDF) (e.g., gamma, beta, exponential, log-normal, Johnson’s SB, Weibull) to fit the diameter distribution for each stand and then predicts the PDF parameters using stand attributes (Liu et al. 2009; Temesgen et al. 2003). The two-parameter Weibull function has been reported to be the simplest and most accurate PDF for diameter distribution modeling for tree species worldwide, owing to its flexibility in describing different shapes of diameter distributions and fewer parameters (Diamantopoulou et al. 2015; Schmidt et al. 2020; Schütz and Rosset 2020). Schmidt et al. (2020) modeled the diameter distributions of clonal eucalyptus stands in Brazil using a Weibull function and re-measured information from 56 permanent sample plots as predictor variables. Schütz and Rosset (2020) compared different methods of predicting Weibull distribution parameters to better describe the diameter distributions of Norway spruce (Picea abies (L.) Karst.) and European beech (Fagus sylvatica L.) in monospecific, regular temperate European forests. However, the diameter distribution modeling approach is only suitable for simple stands with few species and small contiguous areas (Temesgen et al. 2003). For large regions, the application of the Weibull diameter distribution is limited by the availability of the stand attributes that are assumed to have strong correlations with the PDF parameters, which can be predicted by nonparametric imputation methods (Fu et al. 2019).

The nonparametric k-nearest neighbor (kNN) imputation model has been widely applied to predict wall-to-wall forest stand attributes across large regions as it can simultaneously predict a suite of response variables, such as species-specific volume (Breidenbach et al. 2010), species-level biomass (Fu et al. 2019), basal area (Wilson et al. 2012), or stand mean DBH (Zhang et al. 2018b). kNN imputation is also a reliable strategy for the attribute prediction of complex forest stands as it can maintain both the spatial and attribute covariance structures of the stand attributes (Temesgen et al. 2003). The process of kNN imputation for generating wall-to-wall stand attributes (response variables) involves computing the statistical distance between the unsampled target pixel and the reference samples (neighbors) and then assigning the mean value of the kNNs’ attributes to the target pixel (Brosofske et al. 2014). In the development of a kNN imputation model, in addition to the suitable selection of distance metrics and k values, the selection of predictor variables can be more crucial. The predictor variables should be functionally or statistically related to the response variables (Eskelson et al. 2009). The predictor variables used in the kNN model built for stand attributes imputation usually include soil, climate, topographic variables, and vegetation indices derived from optical remote sensing imagery, such as Landsat (Gjertsen 2007) and Moderate Resolution Imaging Spectroradiometer (MODIS) (Fu et al. 2019; Zhang et al. 2018a). The use of airborne light detection and ranging (LiDAR) in the kNN model for the prediction of forest attributes has been rapidly increasing and is promising for application in specific forest management (Hudak et al. 2008; Mauro et al. 2019). For example, the accurate prediction of fine-scale (e.g., tree-level or stand-level) attributes might require predictor variables with three-dimensional and high-resolution information. However, LiDAR lacks imaging capabilities and cannot provide spatially continuous maps (Chi et al. 2015). Therefore, optical imagery is more suitable for use in the kNN model to generate the wall-to-wall stand attributes across large regions.

The objective of this study was to develop a coupled framework combining the strengths of the Weibull function in modeling diameter distributions and the ability of the kNN method to impute spatially continuous forest stand attributes for the prediction of wall-to-wall tree lists at regional scales based on limited sample plot data via the integration of national forest inventory, remotely sensed, and other ancillary data. Further, the coupled framework was tested and assessed, and the wall-to-wall tree lists at a 250-m spatial resolution were predicted for the entire forest region of Northeast China. Finally, the species-specific diameter distributions were analyzed and the spatial patterns of tree density by species were revealed.

2 Material and methods

2.1 Study area

The study area is located in northeastern China (38° 42′–53° 35′ N, 115° 32′–135° 09′ E) and covers 124 million hectares, encompassing three provinces (Heilongjiang, Jilin, and Liaoning) and the eastern part of the Inner Mongolia Autonomous Region. The area is divided into seven major subregions that reflect diverse climates, terrains, soils, and vegetation (Fu et al. 2019): the Greater Khingan Mountains (GKM), Lesser Khingan Mountains (LKM), Changbai Mountains (CBM), Sanjiang Plain (SJP), Songnen Plain (SNP), Liaohe Plain (LHP), and Hulun Buir Plateau (HBP) (Fig. 1). Approximately 40% of the region is forested (Pan et al. 2011), with forests mainly distributed over the Greater Khingan, Lesser Khingan, and Changbai Mountains. The forests in Northeast China are mixed, spanning various forest types, such as cold-temperate conifer mixed forests and temperate coniferous and broad-leaved mixed forests. Cold-temperate conifer mixed forests are distributed in the northern part of the Greater Khingan Mountains and are dominated by larch (Larix gmelinii (Rupr.) Kuzen), white birch (Betula platyphylla Suk.), aspen (Populus davidiana Dode), spruce (Picea koraiensis Nakai), Asian black birch (Betula davurica Pall., hereafter black birch), Mongolian oak (Quercus mongolica Fisch. ex Ledeb.), Mongolian Scotch pine (Pinus sylvestris var. mongolica Litv., hereafter Scotch pine), and willow (Chosenia arbutifolia (Pall.) A. Skv.). Temperate coniferous and broad-leaved mixed forests are distributed in the Lesser Khingan and Changbai Mountains, mainly composed of Korean pine (Pinus koraiensis Sieb.et Zucc.), larch, fir (Abies nephrolepis (Trautv.) Maxim.), spruce, Mongolian oak, white birch, aspen, ribbed birch (Betula costata Trautv.), basswood (Tilia amurensis Rupr.), mono maple (Acer mono Maxim.), elm (Ulmus pumila L.), Manchurian walnut (Juglans mandshurica Maxim., hereafter walnut), Manchurian ash (Fraxinus mandschurica Rupr., hereafter ash), and Amur corktree (Phellodendron amurense Rupr.).

Fig. 1
figure 1

Seven subregions, forest cover fraction, and field data distribution in Northeast China

2.2 Field data

Two types of field data ( plot and stand) were collected for this study. The limited sample plot data with detailed tree lists were used to build species-specific Weibull distribution parameter prediction models (WDPPMS), and the stand inventory data without tree lists were used as response variables in the kNN imputation model to predict the wall-to-wall stand attributes.

The detailed tree lists of 641 rectangular (20 m × 30 m) sample plots were measured from 2009 to 2013 in the study area. The sample plots were mainly distributed over the Greater Khingan Mountains, Lesser Khingan Mountains, and Changbai Mountains (Fig. 1) and contained 17 major tree species in Northeast China. The plot number, GPS coordinates, elevation, slope, forest type, and canopy density for each plot were recorded. Within each plot, tree species, diameter at breast height (DBH), and tree height for every tree with a DBH larger than 5 cm were measured. The arithmetic mean DBH by species in each plot was calculated by dividing the sum of tree DBHs by the number of corresponding tree species.

The stand inventory data were derived from the National Secondary Forest Resources Continuous Inventory in China and organized as stand polygons. A total of 25,000 forest stand polygons (average polygon size: 20.6 ha) from Northeast China surveyed in the early 2000s were collected from the National Forestry and Grassland Data Center (http://www.cfsdc.org/). Each stand polygon was a contiguous area ranging from a few hectares to tens of hectares with relatively homogeneous forest attributes, such as stand mean DBH, stand height, stand age, stand volume density, tree species composition (volume proportion by species), and dominant species. The stand mean DBH is the DBH corresponding to the average basal area of the stand, which differs from the arithmetic mean DBH. In this study, owing to the lack of precise conversion relationships between stand mean DBH and mean arithmetic DBH of major tree species in Northeast China, the stand mean DBH was used instead of the mean arithmetic DBH by species to predict the wall-to-wall Weibull diameter distribution parameters by species. The geometric center sites of 25,000 stand polygons are shown in Fig. 1.

Polygon boundaries were generated by interpreting aerial photos in accordance with the technical regulations for forest management planning and design inventory in China. Within each polygon, stand attributes were estimated using angle-count sampling. Each plot was more than 50-m away from the stand boundary, and the distance between the two angle-count plots was at least 100 m (Zhang et al. 2018a). Trees with a DBH larger than 5 cm were counted in each angle gauge plot (basal area factor = 1), and the DBH of each tree was converted to volume according to the species-specific DBH-volume relationships summarized by the National Forestry and Grassland Data Center (http://www.cfsdc.org/). The volume density of each angle gauge plot was derived by aggregating all single-tree volume estimates counted by species. Further, the stand volume density was estimated by averaging the volume density of all angle gauge plots in each polygon (Zhang et al. 2018a). The species-level biomass of each stand polygon was transformed from the species-level volume based on species-specific biomass-volume relationships (Fang et al. 1998).

2.3 MODIS and environmental data

The selection of predictor variables is crucial when the kNN imputation method is employed to predict wall-to-wall stand attributes (e.g., stand mean DBH or species-level biomass). Forest stand attributes can be characterized by rich spectral information from multispectral remote sensing images and are influenced by environmental factors. In this study, seven Moderate Resolution Imaging Spectroradiometer (MODIS) surface reflectance bands from MOD09Q1 (b1–b2; 250 m) and MOD09A1 (b3–b7; 500 m) from 2000 were selected as basic data to derive the predictor variables for further stand attribute imputation. By averaging all reflectance values of each month in 2000, the above seven reflectance bands were processed into monthly data and resampled to 250 m using nearest-neighbor interpolation. Several spectral indices that correlate with the stand characteristics (Fu et al. 2019) were calculated from the monthly reflectance bands (Additional file 1: Table S1). As the surface reflectance in Northeast China is largely affected by snow or frost cover from January to April and November to December (Zhang et al. 2013), only the monthly MODIS reflectance bands and spectral indices from May to October were used as predictor variables. To avoid imputing stand attributes to non-forest areas, forest areas that were defined as tree cover greater than 10% (Schmitt et al. 2009) were extracted using the MODIS Vegetation Continuous Fields (VCF) product MOD44B (250 m) for the year 2000.

Environmental data (e.g., topographic, climate, and soil data) that correlate with the stand attributes were selected as auxiliary predictor variables (Additional file 1: Table S1) to reduce prediction uncertainties due to environmental heterogeneity. Topographic data (elevation, slope, and cosine of aspect) were derived from the Shuttle Radar Topography Mission (SRTM) digital elevation model (DEM), with a 90-m spatial resolution (http://srtm.csi.cgiar.org/). Climate data (mean annual temperature and cumulative precipitation during 1982–2015; 250 m) were interpolated from the 103 meteorology stations in Northeast China (https://data.cma.cn/) using ANUSPLIN 4.3 software, in which a thin-plate spline function with the resampled SRTM DEM (250 m) as the covariate was used (Fu et al. 2019). Other climate data, such as mean annual radiation and potential evapotranspiration from 1982 to 2015 with a spatial resolution of 0.1° × 0.1°, were derived from the China Meteorological Forcing Dataset (http://westdc.westgis.ac.cn). Soil data with a 1-km spatial resolution were derived from the Harmonized World Soil Database v1.2 (Fao/Iiasa/Isric/Isscas/Jrc 2012). To match the spatial resolution of the MODIS data, the topographic data and other environmental data were resampled to a 250-m spatial resolution using bilinear interpolation and nearest-neighbor interpolation, respectively.

2.4 Overall framework for wall-to-wall tree list prediction

The framework for wall-to-wall tree list prediction included the following three steps (Fig. 2): (1) building Weibull distribution parameter prediction models (WDPPMS) with limited tree list data collected from sample plots to predict species-specific diameter distributions in each pixel; (2) mapping spatially continuous stand attributes (e.g., species-level biomass and stand mean DBH) using the kNN imputation models based on stand inventory data, MODIS, and environmental data, and providing the independent variables for WDPPMs in each pixel; and (3) predicting the wall-to-wall tree lists over Northeast China by combining the Weibull diameter distributions predicted by the WDPPMS and forest attributes imputed by the kNN models.

Fig. 2
figure 2

Workflow adopted for predicting the wall-to-wall tree lists over the forest region of Northeast China. WDPPMS, Weibull distribution parameter prediction models; DBH, diameter at breast height; kNN, k-nearest neighbor. The maps of species-level biomass were obtained from our previous study (Fu et al. 2019)

2.4.1 Building Weibull distribution parameter prediction models (WDPPMS)

The two-parameter Weibull function was used to model the diameter distribution of the 641 plots by species; the function is expressed as follows:

$$f(x)=\frac{b}{c}{\left(\frac{x}{c}\right)}^{b-1}\exp \left[-{\left(\frac{x}{c}\right)}^b\right]$$
(1)

where f(x) represents the probability density function for x, x represents the DBH, b and c are the shape and scale parameters of the Weibull function, respectively.

The maximum likelihood estimation (MLE) was used to predict the parameters of the two-parameter Weibull function for each species in each plot. Owing to its good statistical characteristics and applicability, MLE is generally considered to be the most frequently used method for predicting distribution parameters (Diamantopoulou et al. 2015). The MLE equation is as follows:

$$L\left({x}_{1,}{x}_{2,}\dots, {x}_n;\theta \right)=\prod\limits_{i=1}^nf\left({x}_i;\theta \right)$$
(2)

where x1,x2,…, xn are the sample observations, L(x1,x2,…, xn; θ) is the MLE function, and f(xi; θ) is the probability density function based on the unknown parameter θ. The maximum probability of the sample observations (x1,x2,…, xn) is necessary when the MLE method is used to predict the value of θ. If the MLE function (L) is differentiable, the value of θ must satisfy the following formula:

$$\frac{d L}{d\theta}=0$$
(3)

The unknown parameter θ can be predicted according to (3). The MLE method was implemented using the R package, MASS. Once the shape (b) and scale (c) parameters of the Weibull diameter distributions for each species in each plot were obtained, the relationship between these two parameters and stand variables could be established. In this study, the arithmetic mean DBH by species was selected as the independent variable to build WDPPMS for each species using the ordinary least squares (OLS) method.

2.4.2 Mapping stand attributes

The wall-to-wall stand attributes, including species-level biomass and stand mean DBH, were mapped by integrating the forest attributes extracted from stand polygons (as response variables) with spatially continuous predictor variables (e.g., MODIS and environmental data), as described in Section 2.3, using the kNN imputation model. By comparing the prediction accuracy of 630 kNN models in imputing stand attributes, our previous study mapped the wall-to-wall species-level biomass in 2000 for forests in Northeast China using the optimal kNN model based on the random forest distance metric, single-month (September) MODIS predictor variables, and k = 7 (Fu et al. 2019). In this study, the kNN imputation model was used to map the stand mean DBH in 2000 for the forested region of Northeast China. The imputation process was implemented using the yaImpute package in R.

The kNN imputation method involves the identification of the k-nearest reference samples in a feature space defined by the predictor variables for each target pixel. The values of each response variable within these k-nearest samples are averaged and assigned to the target pixel. Formally, the nearest neighbors prediction, \({\overset{\sim }{\mathrm{y}}}_i\), for the ith target pixel is calculated as follows (McRoberts 2012):

$$\tilde{y}_i=\sum_{j=1}^k{w}_{ij}{y}_j^i$$
(4)

where {\({y}_j^i\)j = 1, 2, …, k} is the set of response variable observations for the k reference pixels that are nearest to the ith target pixel in feature space with respect to a given distance metric, and wij is the weight assigned to the jth nearest neighbor with \(\sum_{j=1}^k{w}_{ij}=1\). The wij is defined as follows:

$${w}_{ij}=\frac{1/\left(1+{d}_{ij}\right)}{\sum_{j=1}^k\left[1/\left(1+{d}_{ij}\right)\right]}$$
(5)

where dij is the distance between the jth nearest-neighbor reference pixel and ith target pixel.

To execute the kNN imputation models, a forest stand polygon (average polygon size of 20.6 ha) was used as the unit of observation. The MODIS and environmental predictor variables of each stand polygon were calculated as the mean values of the raster cells (250-m spatial resolution) with more than 50% of the pixel area covered by the stand polygon. The 25,000 stand polygons containing both response variables (stand attributes) and predictor variables (MODIS and environmental variables) were split into training and testing datasets using a 7:3 split ratio, 70% of which were used to build the kNN imputation model to predict wall-to-wall stand attributes (Fig. 2); the remaining 30% of the stand polygons were used to evaluate the prediction accuracy.

2.4.3 Predicting wall-to-wall tree lists

Maps of species-level biomass for the 17 major species in the study area in 2000 were derived from our previous study, where the prediction accuracy of the species-level biomass (Additional file 1: Fig. S1) was provided (Fu et al. 2019). In this step, maps of species-level biomass were used to predict the tree lists in each pixel by combining the Weibull diameter distribution by species. Once the parameters of the Weibull diameter distribution by species for each pixel were predicted by WDPPMS, the proportion of tree density at each diameter class by species (PTDS) could be calculated for each pixel. Thereafter, the proportion of the species-level biomass at each diameter class (PSB) was calculated by combining PTDS and single-tree biomass by species at each diameter class (Dong et al. 2014; Dong et al. 2015). Finally, the wall-to-wall species-level biomass was divided into each diameter class according to the PSB and converted to the tree density at each diameter class (wall-to-wall tree lists) by dividing them by the single-tree biomass at each diameter class (Fig. 2).

2.5 Accuracy assessment

To guarantee the reliable accuracy of our coupled framework, each step of the framework was assessed using various accuracy metrics.

In step 1, the Kolmogorov-Smirnov (KS) test (Riemann et al. 2010) and the error index proposed by Packalén and Maltamo (2008) were applied to assess the goodness-of-fit of the two-parameter Weibull function for predicting the diameter distribution by species in each plot from the MLE. The KS test does not make any assumption about the distribution of data and quantifies the consistency between the distributions of the two data sets (e.g., observed and predicted samples) based on the maximum difference (D) in their empirical distribution functions [e.g., F(x) and G(x)], which is defined as follows:

$$\mathrm{D}=\max \mid F(x)-G(x)\mid$$
(6)

The larger the P value of the KS test, the higher the prediction accuracy of the DBH Weibull distribution parameters of each tree species obtained by MLE. For each tree species, the plots that passed the KS test were selected as training data to establish the relationship between the stand mean DBH and the two parameters of the Weibull distribution and further calculate the error index of the DBH distribution of each species in each plot. The error index for each plot was calculated as follows:

$$e=\sum_{i=1}^k0.5\left|\frac{f_i}{N}-\frac{{\hat{f}}_i}{\hat{N}}\right|$$
(7)

where fi and \({\hat{f}}_i\) are the observed and predicted stem numbers of class i by species, respectively, k is the number of diameter classes, and N and \(\hat{N}\) are the observed and predicted stem numbers of all diameter classes by species, respectively. The class interval was 2 cm. The value of the error index ranges from 0 to 1, with 0 indicating a perfect fit and 1 indicating no overlap of the distributions (Packalén and Maltamo 2008).

The WDPPMS was evaluated using R2, the root mean square error (RMSE), and bias. The three metrics were calculated based on the parameters of the Weibull diameter distribution function using leave-one-out cross-validation (LOOCV):

$${R}^2=1-\frac{\sum_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2}{\sum_{i=1}^n{\left({y}_i-{\overline{y}}_i\right)}^2}$$
(8)
$$\mathrm{RMSE}=\sqrt{\frac{\sum_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2}{n}}$$
(9)
$$\mathrm{Bias}=\frac{\sum_{i=1}^n\left({y}_i-{\hat{y}}_i\right)}{n}$$
(10)

where n is the number of samples, yi is the observed value for sample i, \({\hat{y}}_i\) is the predicted value for sample i, and \({\overline{y}}_i\) is the mean value of all observed values. To assess the WDPPMS accuracy, n is the number of plots used to fit WDPPMS for each tree species, yi is the predicted shape (b) or scale parameter (c) value of plot i based on MLE, and \({\hat{y}}_i\) is the predicted shape (b) or scale parameter (c) value of plot i based on WDPPMS.

In step 2, the accuracy of the predicted wall-to-wall stand mean DBH over the forest region of Northeast China was evaluated at two scales (stand and ecotype). The study area was classified into 239 ecotypes based on seven subregions, nine landforms (Additional file 1: Fig. S2), and five forest types (Additional file 1: Fig. S3). The seven subregions reflect differences in temperature and moisture conditions. Nine landforms (ridge, upper slope, sunny slope, semi-sunny slope, semi-shady slope, shady slope, flat slope, lower slope, and valley) were classified from DEM using a Topographic Position Index (Zhang et al. 2009). Five forest types (evergreen coniferous, evergreen broad-leaved, deciduous coniferous, deciduous broad-leaved, and mixed forest) were extracted from the resampled MODIS land cover product MCD12Q1 (250 m) in 2000.

At the stand scale, the predicted stand mean DBH of the remaining 30% of stand polygons was calculated by averaging the stand mean DBH values of all pixels with over 50% of their area located in each forest stand polygon. Thereafter, both the observed and predicted stand mean DBH of the testing dataset were averaged to 65 ecotypes at the ecotype scale. The accuracy of the predicted wall-to-wall stand mean DBH over the forest region of Northeast China at stand and ecotype scales was evaluated by calculating the R2, RMSE, bias, empirical cumulative distribution functions (ECDFs), and KS statistic (Eq. (6)) between the predicted and observed stand mean DBH. The R2, RMSE, and bias provide an overall assessment of the prediction accuracy. The ECDFs and KS statistics can quantify the discrepancy in the distributions of the predicted and observed stand mean DBH.

In step 3, to assess the accuracy of the predicted wall-to-wall tree lists in 2000, the tree lists of the 641 sample plots in 2000 were first obtained as the validation data by transforming the tree lists of 641 sample plots from 2009 to 2013 by subtracting the DBH growth of the 17 species in the past 9–13 years using the age-DBH relationships summarized in previous studies (Xu et al. 2020; Zhang et al. 2018b). The central points of the 641 sample plots were used to extract the predicted wall-to-wall tree lists at the corresponding pixel locations. Thereafter, the observed and predicted tree lists of the 641 sample plots in 2000 included the tree density (trees ha-1) of each diameter class by species was used to calculate the error index (Eq. (7)), which was used to evaluate the prediction accuracy of the wall-to-wall tree lists over Northeast China. We separately calculated the error indices by species for each plot and for all plots as a whole.

3 Results

3.1 Weibull distribution parameter prediction models (WDPPMS)

Based on the KS test, the diameter distributions of 17 major tree species in the forest region of Northeast China could be well fitted by the two-parameter Weibull functions predicted by the MLE method, especially for the white birch (B. platyphylla Suk.), basswood (T. amurensis Rupr.), black birch (B. davurica Pall.), mono maple (A. mono Maxim.), elm (U. pumila L.), walnut (J. mandshurica Maxim.), Korean pine (P. koraiensis Sieb. et Zucc.), ribbed birch (B. costata Trautv.), and fir (A. nephrolepis (Trautv.) Maxim.) (Table 1; agreement > 90%).

Table 1 Kolmogorov-Smirnov (KS) test (P > 0.05) of the DBH distribution based on the two-parameter Weibull function on 17 tree species in the forest region of Northeast China

The calculated error indices revealed that the diameter distributions by species in each plot were well-fitted, with the mean error index value ranging from 0.24 to 0.48 and standard deviation of the error index ranging from 0.09 to 0.2. When the number of trees in each diameter class of all plots as a whole by species is summed to calculate the error index, the diameter distributions of all species were well represented with a lower value of error index (0.1–0.38) (Table 2).

Table 2 Error index of the DBH Weibull distribution for 17 tree species in all plots

The accuracy assessment of WDPPMs by LOOCV demonstrated that the scale parameter had a strong correlation with the arithmetic mean DBH for all 17 tree species, with high R2 (close to 1) and low RMSE (0.06–0.52) (Table 3). Although the relationship between the shape parameter and arithmetic mean DBH of each tree species was relatively weak (low R2 and large RMSE), the shape parameter prediction models for all species passed the significance test (P < 0.05), with bias = 0 (Table 3).

Table 3 Equations of shape parameter (b) and scale parameter (c) of Weibull distribution based on arithmetic mean DBH (d, unit: cm) for 17 tree species and accuracy assessment using LOOCV

3.2 Map of stand mean DBH

The average value of the stand mean DBH over the forest region of Northeast China was 13.22 cm. The predicted stand mean DBH of the severely burned area in 1987 was the lowest, with an average value of 8.86 cm (Fig. 3). Subregion CBM, especially the CBM Nature Reserve, had the highest prediction of stand mean DBH. In fact, the average stand mean DBH of subregion CBM was 15.20 cm. The predicted stand mean DBH of the other two major forest regions (subregion LKM and GKM) was lower than that of subregion CBM, with an average value of 12.67 cm and 12.25 cm, respectively.

Fig. 3
figure 3

Map (250-m spatial resolution) of the predicted stand mean DBH in the forest region of Northeast China for the year 2000

The accuracy of the predicted stand mean DBH improved substantially from the stand scale to the ecotype scale, the R2 increased from 0.56 to 0.88, the RMSE decreased from 3.33 to 1.08 cm, and the bias decreased from 0.13 cm to 0.01 cm (Fig. 4a, c). Although the KS distance (D = 0.09) was the same at the ecotype and the stand scale, a higher P value (0.94 vs. 0) for KS distance at the ecotype scale revealed that the predicted and observed stand mean DBH ECDFs became more similar (Fig. 4b, d).

Fig. 4
figure 4

a Density scatterplot between the observed and predicted stand mean DBH at the stand scale, b cumulative distribution functions of the observed and predicted stand mean DBH at the stand scale, c density scatterplot between the observed and predicted stand mean DBH at the ecotype scale, and d cumulative distribution functions of the observed and predicted stand mean DBH at the ecotype scale. The dotted line in a and c is the 1:1 line and the solid line in a and c is the regression line of geometric mean function

3.3 Prediction of the tree lists

The predicted wall-to-wall tree lists revealed the DBH distributions of the 17 major tree species in the forest region of Northeast China (Fig. 5). The predicted DBH of most tree species was concentrated in the range of 6–38 cm, except for spruce (P. koraiensis Nakai), ash (F. mandschurica Rupr.), and Amur corktree (P. amurense Rupr.), which was distributed at a DBH greater than 40 cm. The DBH distributions of aspen (P. davidiana Dode), Scotch pine (P. sylvestris var. mongolica Litv.), spruce (P. koraiensis Nakai), ash (F. mandschurica Rupr.), Amur corktree (P. amurense Rupr.), and willow (C. arbutifolia (Pall.) generally showed a reverse J-shape, and the density of the above tree species was the highest at a DBH of 6 cm. The DBH of white birch (B. platyphylla Suk.), larch (L. gmelinii (Rupr.) Kuzen), Mongolian oak (Q. mongolica Fisch. ex Ledeb.), basswood (T. amurensis Rupr.), black birch (B. davurica Pall.), mono maple (A. mono Maxim.), elm (U. pumila L.), walnut (J. mandshurica Maxim.), Korean pine (P. koraiensis Sieb. et Zucc.), ribbed birch (B. costata Trautv.), and fir (A. nephrolepis (Trautv.) Maxim.) displayed a right-skewed unimodal distribution, and the tree density of these tree species reached its peak at a DBH of 12, 14, or 16 cm.

Fig. 5
figure 5

Predicted tree lists by species over the forest region of Northeast China in 2000

The predicted wall-to-wall tree lists in 2000 had a comparable accuracy to the predicted tree lists of 641 plots from 2009 to 2013 (Table 2, Figs. 6 and 7), with a mean value for the error index of DBH distribution of each species between 0.24 and 0.58 (Fig. 6). When all plots were combined as a whole, the error index of the DBH distribution of each tree species was lower than the mean value of the error index calculated based on all plots, indicating improved accuracy of the predicted tree lists for most species (Fig. 7). Except for aspen (P. davidiana Dode), walnut (J. mandshurica Maxim.), and Amur corktree (P. amurense Rupr.), the predicted density of the other tree species in the smallest DBH class (6 cm) was lower than the observed density. Notably, the predicted and observed DBH distribution curves for black birch (B. davurica Pall.) was the most consistent, showing a left-skewed unimodal distribution, and the error index was the smallest (0.13). Therefore, the prediction accuracy of the DBH distribution of black birch (B. davurica Pall.) was deemed to be the highest (Fig. 7).

Fig. 6
figure 6

Error index calculated based on the predicted wall-to-wall and transformed inventory tree lists in 2000 by species for each plot. The blue diamond represents the mean value of the error index calculated based on all plots

Fig. 7
figure 7

DBH distribution frequency by species of predicted wall-to-wall tree lists versus observed tree lists in 2000 based on the transformed sample plot data. The error index was calculated by combining all transformed plots as a whole

3.4 Maps of species-specific tree density from tree lists

For the maps of species-specific tree density in the forest region of Northeast China in 2000 (Fig. 8), pixels with a tree density greater than 0 indicated the location of tree species growth. The total tree density of the 17 species mapped by summing the species-level tree density within each pixel mainly ranged from 700 trees ha-1 to 3000 trees ha-1, and the average total tree density for the forest region of Northeast China was approximately 1462 trees ha-1 (Fig. 8). Among the three major forest distribution areas of Northeast China (subregions GKM, LKM, and CBM), the average total tree density in subregion LKM (1510 trees ha-1) was higher than that in subregions GKM (1506 trees ha-1) and CM (1393 trees ha-1). In the northeastern part of ecoregion SJP, owing to the large number of newly planted Scotch pine (P. sylvestris var. mongolica Litv.) forests, the total tree density in this region was high (Fig. 8).

Fig. 8
figure 8

Maps of total and species-level tree density in the forest region of Northeast China in 2000

By counting the mean value of pixels with tree density greater than 0 for each tree species, the average tree density of larch (L. gmelinii (Rupr.) Kuzen) (1009 trees ha-1) in its existing area was found to be the highest. In terms of spatial distribution, the tree density of larch (L. gmelinii (Rupr.) Kuzen) in each pixel was significantly higher than that of the other tree species (Fig. 8). White birch (B. platyphylla Suk.) had an opposing trend to larch (L. gmelinii (Rupr.) Kuzen) in the spatial distribution of tree density, especially in the severely burned area of the GKM subregion. Although the distribution range of willow (C. arbutifolia (Pall.) A. Skv.) was the smallest among the 17 tree species, with a high average tree density (766 trees ha-1) in its existing area, second only to that of larch (L. gmelinii (Rupr.) Kuzen), indicating that the distribution of willow (C. arbutifolia (Pall.) A. Skv.) was relatively dense. The tree density of Scotch pine (P. sylvestris var. mongolica Litv.) in the eastern part of the LKM subregion and the northeastern part of the SJP subregion was significantly higher than that in the other regions. The tree density of spruce (P. koraiensis Nakai) and fir (A. nephrolepis (Trautv.) Maxim.) in subregion LKM was higher than that in subregion CM. For other tree species, the predicted wall-to-wall tree density for most pixels was less than 700 trees ha-1. For the Mongolian oak (Q. mongolica Fisch. ex Ledeb.), black birch (B. davurica Pall.), mono maple (A. mono Maxim.), walnut (J. mandshurica Maxim.), ash (F. mandschurica Rupr.), and ribbed birch (B. costata Trautv.), the predicted wall-to-wall tree density for most pixels was less than 300 trees ha-1.

4 Discussion

4.1 Selection of the coupled framework for predicting wall-to-wall tree lists

With the increasing need for detailed forest attributes (e.g., tree lists) to answer regional questions, including understanding the current structure and succession status of a forest, formulating targeted forest management strategies for different forest regions, or predicting tree species distribution shifts under climate change (Ferraz et al. 2020; Wang et al. 2019), the development of efficient and accurate methods to obtain such regional tree list information has attracted significant research interest. Although previous studies have applied both parametric diameter distribution modeling (e.g., Weibull function) and nonparametric imputation (e.g., kNN model) to predict tree lists (Packalén and Maltamo 2008; Schütz and Rosset 2020; Shang et al. 2017), the integration of these two methods proposed in our study is novel for generating wall-to-wall tree lists over large areas with limited forest inventory data.

The Weibull function, especially the two-parameter Weibull function, is extensively used to describe the diameter distributions of single- and mixed-species stands (Hao et al. 2022; Liu et al. 2014; Schmidt et al. 2020; Zhang et al. 2018b). Although Liu et al. (2014) found that a finite mixture model (FMM) of Weibull functions was more flexible for describing irregular and highly skewed diameter distributions for the whole plot in the mixed-species forest stands of the Greater Khingan Mountains in Northeast China than a single Weibull function to fit the whole plot only, a single Weibull function to fit each species component separately was more accurate than the FMM. FMM is often difficult to apply to large regions owing to the many unknown parameters that require good fitting (Zasada and Cieszewski 2005).

A prerequisite for applying a single Weibull function to fit each species component separately to large regions is the available information on wall-to-wall species composition, which is difficult to obtain through traditional forest resource inventory alone (Liu et al. 2014). Our coupled framework can use a Weibull function to fit each species component separately and provide information regarding the species composition across large regions using kNN imputation models based on MODIS data. MODIS data have broader spatial coverage and higher temporal resolution than other high spatial resolution multispectral satellite imagery (e.g., Landsat), provide abundant and near real-time spectral information, and have been widely used to map spatially explicit forest attributes over large regions (Beaudoin et al. 2018; Hayes et al. 2008; Zhang et al. 2018a; Zhang et al. 2013). Therefore, our coupled framework is feasible and effective for predicting species-specific diameter distributions and tree density on a regional scale.

4.2 Accuracy and reliability of the coupled framework’s predictions

Accuracy is an important issue in regional forest attributes predictions. The accuracy of the wall-to-wall tree lists predicted in this study was mainly determined by the WDPPMS and kNN imputation models. For a more general assessment of our coupled framework, we compared the accuracies of the WDPPMS and kNN models with those of similar studies. The WDPPMS used in our study produced an accuracy comparable to that of previous studies. For example, the scale parameter was usually strongly correlated with the stand attributes (Zhang et al. 2018b), consistent with previous studies, and the prediction accuracy of the scale parameter of the 17 tree species in this study was better than the results (R2 = 0.96–0.99) obtained in the natural stands of the Crimean Juniper in the southern and southwestern (Mediterranean) region of Turkey (Diamantopoulou et al. 2015). Furthermore, our studies led to a conclusion similar to that previously reported by Hao et al. (2022), Wang et al. (2006), and Zhang et al. (2018b), who found that the shape parameter is often more difficult to accurately predict than the scale parameter. This finding might be due to the remarkable complexity of the shape parameter in the mixture distribution to allow characterization by the stand mean DBH (Wang et al. 2006). The average error index (0.32–0.34) of the predicted diameter distribution by tree species in a typical managed boreal forest area of Finland (Packalén and Maltamo 2008) was close to that obtained herein for the dominant tree species in Northeast China. The above comparisons indicate that the WDPPMS developed in this study can be used at least as accurately as previous studies to predict species-specific diameter distributions.

The accuracy assessment at different scales for our predicted stand attributes (e.g., species-level biomass and stand mean DBH) could provide useful information for different applications. For example, stand projection models (growth and yield models) require detailed stand attributes at the stand scale to set the initial conditions (Scolforo et al. 2019), whereas forest ecosystem process research may require forest attributes at the ecotype scale (Dijak et al. 2017). In this study, the stand mean DBH predicted using the kNN imputation model over Northeast China at the stand scale had a higher accuracy than that of similar studies conducted in Chinese boreal forests (Zhang et al. 2018b). Owing to the relatively homogeneous temperature, humidity, topography, forest type, and species composition of each ecotype, the prediction accuracy of stand mean DBH at the ecotype scale was significantly improved, indicating that the map of stand mean DBH generated in this study is more suitable for application to forest ecosystem models (e.g., LINKAGES) operated at the ecotype scale. The areas with lower predictions of stand mean DBH in our study were consistent with the burned areas in 1987, further indicating that our predicted stand attributes were realistic and captured the significant disturbance processes. The predicted mean values of total and species-level tree density derived from the tree lists over the Greater Khingan Mountains, Lesser Khingan Mountains, and Changbai Mountains were similar to those of previous studies conducted in Northeast China (Chen et al. 2003; Dai et al. 2013; Liu et al. 2020), suggesting that our predicted tree lists were reasonable.

4.3 Ecological implications of the prediction results

The DBH distribution of tree species exhibits variable patterns owing to site quality, regional climate, and disturbances (Chhin et al. 2008; Lin et al. 2016; Rodrigo et al. 2022). Generally, consistent with the results of previous studies fitting the similar DBH distribution of the corresponding tree species in Northeast China (Hao et al. 2022; Liu et al. 2014; Xu and Jin 2012), the overall DBH distribution of most tree species predicted in this study displayed a right-skewed unimodal pattern (Fig. 5), suggesting that at the regional scale, a continuous regeneration and relatively stable community structure occurred in most tree species in Northeast China (Fang et al. 2012). Further, the regional DBH distribution of a few tree species, such as aspen (P. davidiana Dode), Scotch pine (P. sylvestris var. mongolica Litv.), spruce (P. koraiensis Nakai), ash (F. mandschurica Rupr.), Amur corktree (P. amurense Rupr.), and willow (C. arbutifolia (Pall.), which were fitted in this study, followed a reverse J-shape (Fig. 5), which is a general pattern observed in virgin or near-natural forests. Such findings indicate that the stands of the above tree species were mature and reached an equilibrium state in Northeast China (Král et al. 2010; Westphal et al. 2006). Our predicted species-specific diameter distributions could provide a reference for species conservation and strategic forest management in the entire distribution range of tree species.

Our spatial predictions of tree density by species extracted from tree lists can capture the dominant ecological patterns and processes. In particular, for the high-intensity burned areas of the Greater Khingan Mountains in 1987, the predicted tree density of white birch (B. platyphylla Suk.) was higher than that of larch (L. gmelinii (Rupr.) Kuzen) in 2000, which was lower than that of larch (L. gmelinii (Rupr.) Kuzen) in low-intensity burned areas (Fig. 8). This spatial pattern was attributed to the strong seed dispersal and establishment abilities of white birch (L. gmelinii (Rupr.) Kuzen) and the strong fire resistance of larch (L. gmelinii (Rupr.) Kuzen) (Chang et al. 2007; Wang et al. 2017). Therefore, the high-resolution maps of total and species-level tree density for the entire Northeast China generated in our study could reveal the tree species composition and regeneration patterns of forest regions affected by large disturbance events (Crowther et al. 2015). These maps would also be useful for modeling broad-scale biological and biogeochemical processes (Slik et al. 2010).

4.4 Limitations of the current study and future directions

In this study, we proposed a coupled framework for using MODIS data as predictor variables to impute wall-to-wall stand attributes and predict species-specific diameter distributions using the unimodal Weibull model for continuous forest regions. This framework provides an innovative means of generating detailed wall-to-wall tree lists across large areas based on limited field data. However, the coupled framework of the current study has limitations, and some potential improvements must be addressed in future work. For instance, the limited and unevenly distributed forest inventory data over the study area may introduce error sources to forest stand attribute predictions. Advances in LiDAR techniques have demonstrated great potential for providing accurate estimates of tree metrics, such as tree height (Su et al. 2016). Moreover, as LiDAR footprints are distributed evenly and widely (Popescu et al. 2011), the derived tree-level variables from these footprints can be input to the kNN imputation models to compensate for the lack of field data. Thus, future work should consider the synergy of LiDAR and MODIS data to develop links between inherent forest structural attributes and their spectral information to enhance the reliability of predicting wall-to-wall stand attributes.

Owing to the lack of accurate conversions between stand mean DBH and mean arithmetic DBH of major tree species in Northeast China, the stand mean DBH was used instead of mean arithmetic DBH by species to predict pixel-level Weibull distribution parameters. This replacement will lead to great uncertainties when applied to forests with complex stand structures, especially for uneven-aged natural mixed forests. Some studies revealed that a unimodal pattern may not adequately describe the diameter distribution modality of irregular, uneven-aged stands (Hao et al. 2022; Thomas et al. 2008; Wang et al. 2006). Therefore, in future work, on the premise of collecting sufficient forest ground inventory and LiDAR data, the accurate conversion relationships between stand mean DBH and mean arithmetic DBH of each species must first be established. Thereafter, an appropriate modeling approach must be applied to fit the multimodal diameter distributions of mixed species or uneven-aged forest stands at the regional scale.

5 Conclusion

This study revealed the appropriateness of a novel framework using limited forest inventory data to predict wall-to-wall tree lists across large regions by integrating the two-parameter Weibull function and kNN imputation method. The Weibull function displayed a strong ability to describe the diameter distribution of tree species in Northeast China. The Weibull distribution parameter prediction models (WDPPMS) of 17 species for predicting the parameters of Weibull functions, and the random forest-based kNN model for imputing stand mean DBH built produced considerable or higher accuracy than those of previous studies. Generally, the fitted DBH distribution of 17 tree species across Northeast China in 2000 displayed a right-skewed unimodal or reverse J-shaped pattern, suggesting that most tree species exist in a state of continuous regeneration and relatively stable structure at the regional scale. Additionally, the total and species-level tree density maps derived from the tree lists captured the forest regions influenced by fire disturbance and are prerequisites for ecological modeling and assessment of the entirety of Northeast China. However, this study presented regional tree lists across Northeast China for only 1 year (2000). To better analyze the natural succession of forests and assess the influence of disturbances and climate change on tree species composition, the temporal and spatial variation of tree lists over Northeast China should be further studied.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Download references

Acknowledgements

We thank Qinglong Zhang from the Shandong University of Technology for the technical support.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 32101327), and the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (Grant No. 111-G1323522043).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Yuanyuan Fu and Hong S. He; Methodology: Yuanyuan Fu; Data curation: Yuanyuan Fu; Writing—original draft preparation: Yuanyuan Fu; Writing—review and editing: Hong S. He, Shaoqiang Wang, and Lunche Wang; Supervision: Hong S. He. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Hong S. He.

Ethics declarations

Ethics approval and consent to participate

The authors declare that they follow the rules of good scientific practice.

Consent for publication

All authors gave their informed consent to this publication and its content.

Competing interests

The authors declare that they have no competing interests.

Additional information

Handling editor: Andreas Bolte

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Candidate predictor variables used in the kNN imputation models (Fu et al. 2019). The growing season was defined as May to October. Fig. S1. Maps of total AFB and 17 species-level AFB in Northeast China generated by Fu et al. (2019). AFB: Aboveground forest biomass. Fig. S2 Nine landforms in Northeast China classified from the DEM using Topographic Position Index (Zhang et al. 2009). Fig. S3 Five forest types in Northeast China extracted from the resampled MODIS land cover product MCD12Q1 (250 m) in 2000. R codes for generating wall-to-wall species-level biomass. R codes for generating wall-to-wall stand mean DBH. R codes for generating wall-to-wall tree lists.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, Y., He, H.S., Wang, S. et al. Combining Weibull distribution and k-nearest neighbor imputation method to predict wall-to-wall tree lists for the entire forest region of Northeast China. Annals of Forest Science 79, 42 (2022). https://doi.org/10.1186/s13595-022-01161-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13595-022-01161-9

Keywords