Skip to main content
  • Research Paper
  • Published:

The assignment of relevés to pre-existing vegetation units: a comparison of approaches using species fidelity

Abstract

• Key message

Total fidelity value index can be used for the assignment of new relevés to existing vegetation units and it can be used to refine classifications derived from unsupervised clustering.

• Context

Diagnostic species is an important concept in vegetation classification. Apart from its usefulness to characterize species niche preferences, the diagnostic species concept is used in vegetation classification: (1) for the assignment of new relevés to the vegetation units of an existing classification; (2) to refine vegetation classifications by reassigning relevés that sustain the definition of vegetation units.

• Aims

The main aims were to evaluate the relative predictive performance of different statistical fidelity measures for the reassignment of relevés to existing vegetation units, and in which cases reassignments improve the quality of the original classification.

• Methods

We took the classifications produced by three commonly used unsupervised classification methods, and all relevés were reassigned to the closest vegetation unit according to the total fidelity value index (TFVI), where fidelity value had been calculated using one of eight distinct statistical measures, and according to the frequency-positive fidelity index (FPFI). Classifications obtained after relevé reassignments were compared to the initial ones using the Adjusted Rand Index. The quality of all classification solutions, including the initial ones, was evaluated using thirteen different evaluator statistics.

• Results

The predictive performance of IndVal was the best among all eight fidelity indices in the TFVI framework, and also outperformed FPFI. The TFVI framework based on group-equalized fidelity indices produced better results than other assignment rules in terms of the chosen evaluator statistics. Re-assignments based on IndVal, r, or FPFI produced classifications with the best quality, when combining the results of all evaluators.

• Conclusion

We conclude that TFVI based on IndVal and r has the best quality for assigning of new relevés to existing vegetation units, and it also could be used to refine classifications derived from unsupervised clustering. Consequently, our results reiterate that TFVI, which is new in vegetation sciences, can be a good alternative for FPFI, as the most commonly used in the assignment of vegetation plots (relevés), to predefined vegetation types in large datasets.

1 Introduction

Diagnostic species is an important concept in vegetation classification (Whittaker 1962; Westhoff and van der Maarel 1973). It refers to those species that, due to their niche preferences, concentrate their occurrence or abundance in a single vegetation unit or in a few vegetation units (Dengler et al. 2008). Under the Braun-Blanquet approach to vegetation classification, the degree of concentration is usually called fidelity. There are several statistical techniques for the determination of fidelity and diagnostic species. Among them, those that assess the strength of association between species and groups of relevés (i.e., vegetation plot records) using correlation or indicator value indices are the most widely used (Chytrý et al. 2002; De Cáceres and Legendre 2009; De Cáceres et al. 2012).

Apart from its usefulness to characterize species niche preferences, the diagnostic species concept is used in vegetation classification: (1) for the assignment of new relevés to the vegetation units of an existing classification; (2) to refine vegetation classifications by reassigning relevés that sustain the definition of vegetation units (Tichý 2005; Dai et al. 2006). To conduct these tasks, phytosociologists traditionally (re)assigned relevés informally using the diagnostic species concept. However, nowadays fidelity values can be formally incorporated into the calculation of the numerical similarity between the relevé to be assigned and each of the vegetation units (van Tongeren et al. 2008). Similarity indices are often used to assign relevés to vegetation units based on complete species composition. For example, Hill (1989) described a method based on a modified Czekanowski coefficient of similarity between observed and expected numbers of species in each constancy class. Two assumptions underpin the incorporation of fidelity values into the calculation of the similarity between relevés and vegetation units: (a) some species may be more informative than others for the assignment of relevés to vegetation units; (b) since diagnostic species are indicators of their preferred habitats (De Cáceres et al. 2012), fidelity values are a reasonable way of defining species weights in the calculation of similarity to vegetation units(Chytr and Tichý. 2003). Following this rationale, Tichý (2005) developed four similarity indices using the phi coefficient as a fidelity measure. Among them, the frequency-positive fidelity index (FPFI) has been often used for the assignment of relevés that were misclassified or classified to more than one groups in supervised classifications (Boublík et al. 2007; Douda 2008; Boublík 2010; Janišová et al. 2010; Svitková and Šibík 2013; Landucci et al. 2013; Rodríguez-Rojo et al. 2014; Chytry and Tichy 2018; Maciejewski et al. 2020). Another approach employing the diagnostic species concept in a similarity index was proposed by Dai et al. (2006), who suggested using the total indicator value index (TIVI) to test the validity of a TWINSPAN classification and to refine the initial classification by reassigning relevés. While Dai et al. (2006) based relevé assignments on fidelity values calculated using the indicator value (IndVal) index (Dufrêne and Legendre 1997), Esmailzadeh and Asadi (2014) recently suggested replacing IndVal with the phi coefficient, calling the resulting approach total phi fidelity index (TPFI). Gégout and Coudun (2012) also developed the fidelity index of relevés (FI) to reassign relevés according to fidelity values, but in contrast to previous indices FI is defined as the sum of phi values, regardless of species abundance or frequency.

Here, we propose that TIVI and TPFI can be unified in a single assignment framework that can be called the total fidelity value index (TFVI). For each relevé and vegetation unit, TFVI is defined as the sum, across all species occurring in the relevé, of the species fidelity values for the vegetation unit multiplied by the species abundance values in the relevé. Each relevé is then assigned to the vegetation unit for which TFVI is highest. Several alternatives can be used as a choice of fidelity measure in the TFVI framework (Tichý and Chytrý 2006; De Cáceres and Legendre 2009), and it is unknown how this choice may affect the results. Therefore, in this paper, we compare the performance of eight different statistical fidelity measures for their use within the TFVI framework. Because relevé assignments are commonly based on the FPFI, we also include this assignment rule in our evaluation. We take vegetation data from the Hyrcanian Box tree (Buxus hyrcana Pojark.) and common yew (Taxus baccata L.) forests and the result of three frequently-used unsupervised methods as initial vegetation classification. We evaluate the nine different assignment rules: (a) in terms of the predictive performance; (b) in terms of the quality of the vegetation units after re-assignment and also in terms of several evaluator indices. Evaluating predictive performance (a) implies the assumption that the number of groups is known and the original classification is the “truth” to be reproduced by the assignment rule. In contrast, evaluating the quality of classification after reassignment (b) assumes that the original classification may be improved. We conduct this evaluation without fixing the number of groups, but we penalize those cases when the classification obtained after reassignment reduces the number of relevés sustaining the definition of some vegetation units to the point of compromising their statistical validity.

2 Methods

2.1 Study area

The Hyrcanian region is a narrow green belt covering an area of 1.9 million ha in northern Iran and 20,000 ha in the Republic of Azerbaijan. In Iran, this region is located between the Caspian Sea and the northern foothills of Alborz mountains in three northern provinces: Gilan (western part), Mazandaran (middle part), and Gorgan (eastern part). Unlike most of Iran, the Hyrcanian region is relatively humid, with an average annual rainfall that ranges between 530 mm in the east and 1350 mm in the west (with an occasional record over 2000 mm in some locations). Rainfall mostly occurs during late fall, winter, and spring. The average annual temperature varies from 15 °C in the west to 17.5 °C in the east. The average temperature of the warmest month ranges from 28 to 35 °C while that of the coldest month is between 1.5 and 4 °C (Sagheb-Talebi et al. 2014). Brown soils (i.e., calcareous, forest acidic, podzolic, and non-podzolic soils) are the most important soil type, comprising approximately 90% of the region (Habibi Kasseb 1992). Hyrcanian forests are dominated by combinations of oriental beech (Fagus orientalis Lipsky), Caucasian oak (Quercus castaneifolia C.A.Mey.), hornbeam (Carpinus betulus L.), and Persian ironwood (Parrotia persica C.A.Mey.), and depending on the site Acer velutinum Boiss., Tilia rubra DC., Fraxinus excelsior L., Alnus subcordata C.A.Mey., and B. hyrcana (Marvie Mohadjer 2005).

In Hyrcanian forests, plant communities of B. hyrcana are the remnants of broad masses that formerly occupied lowlands (along with Q. castaneifolia, C. betulus, and P. persica) and steep slopes of wet valleys (along with F. orientalis) but are now restricted to limited areas. They are characterized by a low species richness and variability in species composition and are adapted to grow on sites within a range of edaphic conditions, as long as they are exposed to the adequate air moisture in the southern parts of the Caspian Sea, Northern Iran (Asadi et al. 2011; Esmailzadeh et al. 2014; Soleymainpour and Esmailzadeh 2015). The broad ecological niche and the high sociability of B. hyrcana makes the composition of box tree stands in Hyrcanian forests to be highly variable across the study area. Box trees co-occur with Zelkova carpinifolia (Pall.) Dippel and Celtis australis L. as drought-tolerant tree species in the eastern lowland Hyrcanian forests to Pterocarya fraxinifolia (Lam.) Spach and Populus caspica (Bornm.) Bornm. as hygrophilous tree species in the western part of Hyrcanian lowland forests. Along elevation gradients, box trees co-occur with Q. castaneifolia at low elevations and F. orientalis in highland forests (up to 1700 m). Box tree stands are distributed in a wide range of geographical slopes: in steep, north-oriented slopes, they are accompanied by T. baccata, Prunus laurocerasus L., and Danae racemosa (L.) Moench as a hygrophilous indicator species of Hyrcanian highland forests. However, in slopes with a little lower air humidity, they are accompanied by T. rubra, Acer cappadicicum Gled., and Sorbus torminalis (L.) Crantz.

The Hyrcanian T. baccata communities’ dataset was also included as the second dataset for reinforcing the significance of the results. T. baccata is the only coniferous species that is able to distribute in the main Hyrcanian forest types. T. baccata is often individually scattered or consisting small populations in humid sites (i.e., with high air humidity, but not humid soils) of northern steep slopes as well as hillslopes of northern valleys throughout Hyrcanian mountainous forests (Sagheb-Talebi et al. 2014). In much more humid sites of Hyrcanian forests, it forms pure and mixed dense large populations (Sagheb-Talebi et al. 2014).

2.2 Sampling

Habitats containing B. hyrcana forests were searched from the Cheshmeh-Bolbol protected area (west of Golestan Province) to Lire-sar in the west of Mazandaran province (Fig. 1). These habitats are located from 50 m a.s.l in Sisangan protected area to 1750 m a.s.l in Farim, the highest altitude of B. hyrcana in Hyrcanian forests. After this initial survey, vegetation was sampled using the Braun-Blanquet relevé method in summer 2010 to 2014. Vegetation plots (with an area 400 m2) were conducted in stands considered representative to provide an appropriate representation of the variability of B. hyrcana forests (Mueller-Dombois and Ellenberg, 1974). To include any possible change in vegetation indicating variations in habitat conditions, while considering the principle of a representative stand, we defined transects which were 400-m stretches systematically set along the altitudinal gradient and relevés were conducted whenever floristically or environmental (especially in topographical features) alteration was perceived. In each vegetation plot, all vascular plant species were recorded, and their percentage cover was visually assessed using a modification of the ordinal van der Maarel (1979) cover-abundance scale (0, absent; 1, 0–1%; 2, 1–2.5%; 3, 2.5–5%; 4, 5–12.5%; 5, 12.5–25%; 6, 25–50%; 7, 50–75%; 8, 75–100%). Cover-abundance class values were replaced by the mean cover of each cover class. The resulting dataset (referred here as the B. hyrcana dataset) included 484 relevés and 157 species.

Fig. 1
figure 1

Map of the two Iranian provinces where Hyrcanian forests were sampled. Dots indicate sampling locations corresponding to the B. hyrcana (circles) and T. Baccata (triangles) dataset

Also 408 relevés in habitats containing T. baccata in central and eastern of the Hyrcanian forests in summer 2015, 2017, and 2018 were sampled. These forests were distributed from Siah-roudbar Valleys in the east of Golestan province to Mazga in the west of Mazandaran province (Fig. 1). These habitats are located from 1000 m a.s.l in Gazoo to 2000 m a.s.l in Afrathakhteh. This dataset is referred to here as the T. baccata dataset.

2.3 Initial classifications

We used three unsupervised classification methods to classify the compositional structure of the both B. hyrcana and T. baccata datasets into vegetation units: (1) modified TWINSPAN (Roleček et al. 2009), (2) partitioning around medoids (PAM, Kauffman and Rousseeuw 1990), and (3) k-means (Mac Queen 1967). Modified TWINSPAN is a hierarchical divisive method that combines the classical TWINSPAN algorithm (two-way indicator species analysis; Hill 1979) with an analysis of heterogeneity of the clusters prior to each division. Unlike the original version, modified TWINSPAN does not enforce a dichotomy of classification but instead, at each step, divides only the most heterogeneous cluster of the previous hierarchical level. Thus, the application of modified TWINSPAN results in vegetation units of similar internal heterogeneity (Luther-Mosebach et al. 2012). We applied total inertia (i.e., the sum of all eigenvalues in correspondence analysis) as measure of cluster heterogeneity (Roleček et al. 2009) and pseudo-species cut levels were set to 0%, 1%, 2.5%, 5%, 12.5%, 25%, 50%, 75%, and 100%. K-means and PAM are commonly used non-hierarchical clustering methods (Legendre and Legendre, 2012; Tichý et al. 2014). Both of them require the number of clusters and the initial group members to be specified by the user. The main difference between k-means and PAM is that in the former each cluster is represented by its centroid, a multivariate mean, whereas in the latter, each cluster is represented by its medoid, the cluster member that has the minimum sum of distances to all the other members of the cluster. Both methods have an objective error function that is progressively minimized by iteratively reassigning objects (i.e., relevés) to their nearest cluster center (centroid or medoid). Iterations terminate when no further reassignments are possible. We ran k-means and PAM starting from 100 random initial configurations to avoid local minima of the error function. The floristic resemblance between pairs of relevés was assessed using the Hellinger distance, which can be emulated by transforming relevé data prior to calculation of Euclidean distances (Legendre and Gallagher 2001; De Cáceres et al. 2008). All classification methods were executed using the JUICE software (Tichy 2002) based on both vegetation datasets.

We used the OptimClass procedure (Tichy et al. 2010) to determine the optimum number of clusters (Appendix 1) in both B. hyrcana and T. baccata datasets. Specifically, we searched for the partition with the largest number of faithful species across all clusters. Faithful species were determined based on the p-value of the Fisher’s exact test as a measure of fidelity (Tichy et al. 2010). OptimClass was run on 40 classification algorithms with five distance measures (i.e., Euclidean, relative Euclidean, correlation, chi-square, and relative Sorensen) and eight methods of group linkage (i.e., Flexible Beta, McQuitty’s, Ward’s, centroid, group average, median, nearest and farthest neighbor methods) based on square-root transformed cover percentage. Across all classification methods, the relationship between the number of faithful species and the number of clusters (groups) revealed that classification by 18 group numbers in B. hyrcana forests and 17 group numbers in T. baccata forests presented the most number of faithful species and it was thereafter considered as optimal group number in. For the evaluation of predictive performance, we took the partitions obtained by TWINSPAN, PAM, and k-means with 18 and 17 groups as the initial classification to be recovered in numbers in B. hyrcana and T. baccata datasets, respectively. For the evaluation of the quality of classification after reassignment, however, we kept the partitions generated by TWINSPAN, PAM, and k-means for 2, 3, …, 18 groups in B. hyrcana datasets and 2, 3, …, 18 groups in T. baccata datasets as initial classifications to be refined.

2.4 Assignment rules

The TFVIig value for a given relevé i and vegetation group g is defined as:

$$TFVI_{ig} = \sum\limits_{j = 1}^{{s_{t} }} {C_{ij} } \times FV_{gj}$$
(1)

where FVgj is the fidelity (indicator) value of species j in group g and Cij is the cover value of species j in relevé i. Once the TFVIig value is calculated for all vegetation units, the assignment rule consists of assigning relevé i to the group corresponding to the highest TFVIig value.

There are many alternative indices to define the fidelity value of species (Tichý and Chytrý 2006; De Cáceres and Legendre 2009). To determine how important was the choice of a fidelity index for assignments in the TFVI framework, we compared eight different alternatives, differing in the general approach (i.e., correlation vs. indicator value indices, or IndVal, indices), in the way differences in group size are dealt (i.e., non-equalized vs. group-equalized) and in whether species abundance values are taken into account (Table 1, taken from De Cáceres and Legendre 2009). For correlation indices, we only used species with positive fidelity values in the calculation of TFVI. Calculations were performed using the R statistical language and the “indicspecies” package, version 1.7.5 (De Cáceres and Legendre

 2009). Because it is frequently used for relevé assignments, we also considered FPFI (Tichý 2005) as an additional assignment rule to be evaluated. FPFI is a combination of the frequency index (FQI) and positive fidelity index (PFDI) (Eq. 1). The FQIig (Eq. 2), PFDIig (Eq. 3), and FPFIig (Eq. 4) value for a given relevé i and vegetation group g are defined as:

$$FQI_{ig} = 100 \times \left( {{{\sum\limits_{j \in R} {FQ_{gj} } } \mathord{\left/ {\vphantom {{\sum\limits_{j \in R} {FQ_{gj} } } {\sum\limits_{j \in C} {FQ_{gj} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{j \in C} {FQ_{gj} } }}} \right)$$
(2)
$$PFDI_{ig} = 100 \times \left( {{{\sum\limits_{j \in R} {FD_{gj} } } \mathord{\left/ {\vphantom {{\sum\limits_{j \in R} {FD_{gj} } } {\sum\limits_{j \in C} {FD_{gj} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{j \in C} {FD_{gj} } }}} \right)$$
(3)
Table 1 Non-equalized and group-equalized versions of the correlation indices (r) and the indicator value indices (IndVal)
$$FPFIig\hspace{0.17em}=\hspace{0.17em}100\hspace{0.17em}\times \hspace{0.17em}(FQIig\hspace{0.17em}+\hspace{0.17em}PFDIig) /\space2$$
(4)

where FQgj is the frequency (constancy) value of species j in group g. Species present in the relevé are indicated as j \(\in\) R and species present in constancy column as j \(\in\) C. FDgj is positive fidelity value (phi coefficient) for species j to a g vegetation unit. All assignment algorithms were done based on the results of every third initial classification methods in both B. hyrcana and T. baccata datasets separately.

2.5 Evaluation of predictive performance

Before using it for the assignment of new relevés, it is important to evaluate the predictive performance of any given assignment rule (i.e., to evaluate to which degree a “known” classification can be reproduced). We used each of the three 18-group initial classifications (produced by modified TWINSPAN, k-means, and PAM) for the reassignment of the 484 relevés according to TFVI and FPFI. Reassignment of each target relevé was done after excluding it from the calculation of fidelity values. Note that this does not make the assignment rule completely independent of the relevé to be assigned, since it still influences how the original classification was obtained. Nevertheless, this effect is very difficult to remove and we think it may be small in most cases. In total, 27 new classifications were obtained being the result of reassigning relevés of each initial classification using each of the nine assignment rules (i.e., TFVI with eight fidelity measures plus FPFI) in each datasets.

Assuming that the initial classification was the “known” classification solution, we calculated the adjusted Rand index (ARI) (Hubert and Arabie 1985) to assess the degree of agreement between the initial classification and the classification obtained after reassignment. ARI values are bounded between 0 and 1, with a value of 1 meaning perfect agreement between the initial classification and the TFVI result and a value of 0 meaning that agreement was not any better than what would be obtained by chance. ARI was calculated using the R statistical language in the “mclust” package, version 5.1 (Fraley et al. 2012). Ranks of ARI values were used to compare the performance of different assignment rules across the three classifications in each datasets.

2.5.1 Evaluation of the quality of vegetation classifications

For the evaluation of the quality of vegetation units, we tested the nine assignment rules on the partitions generated by the three classification methods for each number of groups between 2 and 18 in B. hyrcana datasets and 2 and 17 in T. baccata datasets. This resulted in 459 new classifications (3 classification methods × 9 assignment rules × 17 clustering levels) in B. hyrcana datasets and 432 new classifications (3 classification methods × 9 assignment rules × 16 clustering levels) in T. baccata datasets. In some cases, reassignments led to a reduction in the number of members of some vegetation units. The minimum group size for accepting a group as ‘statistically valid’ was conventionally set to three relevés, and assignment rules leading to a reduction in the number of valid vegetation units were penalized (see below). The 459 classifications for B. hyrcana and the 432 classifications for T. baccata were evaluated were evaluated with eight internal classification evaluators, most of which have been tested and reviewed in the literature (Aho, Roberts and Weaver 2008; Roberts 2015). The eight evaluators consist of five geometric and three non-geometric measures, and are summarized in Table 2 (see discussion about the circularity in the choice of evaluators in Sect. 4). Among non-geometric evaluators, we applied (1) Morisita’s index of niche overlap (Horn 1966) that evaluates classification effectiveness concerning species distributions, (2) ISAMIC-indicator species analysis to minimize intermediate constancy (Roberts 2010) measures the constancy (either high or low) of species within groups irrespective of how many groups that species occurs in (Roberts 2015), and (3) ISA-indicator species analysis (Aho et al. 2008) is derived from the indicator value (IndVal) of species (Dufrene and Legendre 1997). The IndVal has long been the most popular measure to assess species importance in community classifications (Podani and Csanyi 2010) as it is one of the most widely used goodness of clustering index (Roberts 2015). High ISA values indicating high fidelity and abundance of species within groups (Aho et al. 2008). ISA is presented in two modes: average p value and number of significant indicator (α = 0.05), but we used the last one. p Values for ISA was calculated with Monte-Carlo procedures.

Table 2 Summary of classification solution evaluators in this paper

For the geometric evaluators, we considered five indices: (4) C-index (Hubert and Levin 1976), (5) PBC-point biserial correlation (Brogden 1949), (6) PARTANA ratio (Roberts 2005), (7) ASW-average silhouette width (Rousseeuw 1987), and (8) ANOSIM-analysis of similarity (Clarke 1993). Geometric indices evaluate classification effectiveness based on the relationship of pairwise dissimilarities within- and between-groups. For the calculation of geometric evaluators, Euclidean distance was used to create the required distance matrices based on presence-absence and abundance data (after Hellinger’s transformation) as two types of species data. We used both presence-absence and abundance data to avoid biasing our evaluation towards reassignments made with either incidence-based or abundance-based fidelity measures. Finally, we used thirteen classification evaluators. All evaluators were run in R with “plant.ecol” package, version 0.4-1 (Aho, K. 2017. https://sites.google.com/a/isu.edu/aho) for Morisita, C-index, and PBC; “labdsv” (Robert and Robert 2016. https://cran.r-project.org/web/packages/optpart/index.html) for ISAMIC; “optpart” (Robert 2010. https://cran.r-project.org/web/packages/optpart/index.html) for PARTANA; “cluster” (Maechler et al. 2013. https://cran.r-project.org/web/packages/cluster/index.html) for Silhouette; “indicspecies” (De Cáceres et. al 2016. https://cran.r-project.org/web/packages/indicspecies/index.html) for ISA, and “vegan” (Oksanen et al. 2013. https://cran.r-project.org/web/packages/vegan/index.html) for ANOSIM indices.

A numerical value was obtained for each evaluator and each classification solution. The reduction in the number of groups can lead to an inflation of the evaluator statistic (because the vegetation units that loose relevé members during reassignment have lower quality). Therefore, we penalized the fact that some assignment rules led to some vegetation units having less than three relevés. In these cases, we decreased the value of the evaluator multiplying the ratio between the number of “statistically valid” groups and the number of initial groups (this proportion hereafter called “correction ratio”). After possibly modifying the evaluator values, we averaged the values of each evaluator across the partitions of 2 to 18 groups in B. hyrcana datasets and 2 and 17 in T. baccata datasets. For each classification method, the ten evaluator values, corresponding to the initial classification and the nine assignment rules, were ranked from best (1) to worst (10). To combine the results of the different evaluators, we calculated median ranks (across the thirteen evaluator statistics) for each classification. Ranks of these medians were used to compare the different assignment rules globally. All synthesizing process was summarized in a flowchart (Fig. 2). All analyses were separately done based on two available datasets.

Fig. 2
figure 2

Flowchart of all synthesizing process

3 Results

3.1 Evaluation of predictive performance

All assignment rules produced some distortion of the three initial classifications. The predictive performance of the TFVI framework varied strongly depending on the fidelity measure, whereas that of the FPFI assignment rule was rather low especially in the T. baccata dataset. Indval obtained the best rank in its performance to reproduce all three initial classifications in both B. hyrcana and T. baccata dataset and, hence, the best median rank. The order of the remaining assignment rules in terms of performance was IndVal, Indval, r, r, r, r, FPFI, and IndVal, respectively, in the B. hyrcana dataset (Table 3, section A). However, Indval had the best predictive performance in the T. baccata dataset, but with a different order in the remaining assignment order, little has been changed including: r, r, IndVal, IndVal, Indval, r, r, and FPFI (Table 3, section B).

Table 3 Adjusted Rand index (ARI) values for all cluster solutions in B. hyrcana (section A) and T. baccata (section B) datasets. Ranks (1 to 9) are indicated in parentheses

3.2 Quality of the classification after reassignment

The average (from 2 to 18 groups in B. hyrcana and from 2 to 17 groups in T. baccata dataset) values of quality evaluators are shown in Table 4 (averages after correcting for the decrease in the proportion of valid groups are underlined). These are shown in two sections A and B corresponded to B. hyrcana and T. baccata dataset, respectively. Among the eight fidelity measures, only the reassignment with Indval did not change the number of ‘statistically valid’ groups for partitions generated by any of the three classification methods in the B. hyrcana dataset. Using other indices in the TFVI framework or using the FPFI assignment rule in B. hyrcana dataset led to a decrease in the number of valid groups for at least one initial classification. But the changes in the number of statistically valid groups, which were derived by nine algorithms, were relatively low in the T. baccata dataset and reassignment with Indval as well as r did not change the number of groups in all three initial classification methods.

Table 4 Average values of each evaluator of the initial classification and that obtained with each assignment rule. Averages were calculated across the different number of groups 2 to 18 in B. hyrcana (section A) and 2 to 17 in T. baccata (section B) datasets, before (not underlined) and after (underlined) correcting for the ratio of “statistically valid” Appendix 3 groups. Ranks (1 to 10) are calculated from the corrected averages (i.e., from italicized values). Incidence-based evaluators are specified in the table with braces

Boxplot of the ranks of the thirteen evaluators, except ISA, in ten clustering solutions for B. hyrcana dataset indicate that at least one of the nine reassignment methods resulted in an improvement in the value of the evaluators compared to the initial classification (Fig. 3). In this dataset and from the point of view of abundance-based evaluators except for 1-Morisita and ISAMIC, TFVI based on Indval was the best solution. The TFVI based on r and r assignment rules obtained the first rank for Morisita and ISAMIC indices, respectively. However, in terms of incidence-based evaluators (i.e., 1-C.index, PARTANA, and ANOSIM), TFVI based on r assignment was ranked first, followed by the FPFI (i.e., Sil and PBC). The initial solution was ranked first for the ISA evaluator only. In this relation, boxplot of the ranks related to all evaluators in ten clustering solution for T. baccata dataset indicate that at least one of the nine reassignment methods improve the value of the evaluators compared to the initial (Fig. 4). From the point of view of abundance-based evaluators in T. baccata dataset except for Morisita, PBC and ISAMIC, TFVI based on Indval was the best solution. The TFVI based on r, IndVal and r assignment rules obtained the first rank for Morisita, PBC and ISAMIC indices, respectively. However, in terms of incidence-based evaluators (i.e., PARTANA and ANOSIM), TFVI based on r assignment was ranked first, followed by the IndVal (i.e., 1-C.index and PBC). But TFVI based on Indval acquired the first rank in term of incidence-based Sil evaluator. Based on T. baccata dataset, the initial solution was not ranked first at all.

Fig. 3
figure 3

Boxplot of the evaluator ranks across ten cluster solutions in the B. hyrcana dataset. Boxplots were drawn based on 51 values per box (3 initial classifications × 17 clustering levels): 1 initial algorithm, 2 FPFI and 3, 4, 5, 6, 7, 8, 9, and 10 are IndVal, 4 Indval, 5 IndVal, 6 Indval, 7 r, 8 r, 9 r based, 10 r based TFVI models. These are partitioned into three definite parts including (a) initial with FPFI solutions, (b) IndVal-based solutions, and (c) phi-based solutions. The box extends from the first quartile to the third quartile. The crossed line in the center of the box is the median. Each boxplot is based on three ranks, corresponding to the three initial classifications. The best solution (lowest average rank) is indicated using a checkmark symbol

Fig. 4
figure 4

Boxplot of the evaluator ranks across ten cluster solutions in the T. baccata dataset: 1 initial algorithm, 2 FPFI and 3, 4, 5, 6, 7, 8, 9, and 10 are IndVal, 4 Indval, 5 IndVal, 6 Indval, 7 r, 8 r, 9 r based, 10 r based TFVI models. These are partitioned in three definite parts including (a) initial with FPFI solutions, (b) IndVal-based solutions, and (c) phi-based solutions. The box extends from the first quartile to the third quartile. The crossed line in the center of the box is the median. Each boxplot is based on three ranks, corresponding to the three initial classifications. The best solution (lowest average rank) is indicated using a checkmark symbol

Calculating median ranks revealed that assignment rules sometimes led to classifications with the better overall quality compared to the initial classification in both datasets (Table 5). In the B. hyrcana dataset, in terms of incidence-based evaluators, an improvement of the quality of classification was obtained by TFVI in combination with r and FPFI, whereas in terms of abundance-based evaluators, only reassignments using TFVI based on Indval led to classifications of better quality than the initial classification. This process in the T. baccata dataset also revealed that the TFVI algorithm based on r led to the best refinement of initial classification. While in terms of abundance-based evaluators, Indval, IndVal, and FPFI were ranked first, while r gained the second importance.

Table 5 Median ranks for incidence-based, for abundance-based evaluators and all evaluators in B. hyrcana (section A) and T. baccata (section B) dataset. Ranks of median ranks are shown in parentheses

In the overall comparison of assignment rules (including both incidence-based and abundance-based evaluators) in both vegetation datasets, we found that TFVI based on Indval, r indices, and FPFI performed best in terms of evaluation statistics, followed by the initial classifications and finally the other TFVI algorithm (Table 5). From the point of view incidence-based evaluators, TFVI based on r had the highest quality while the Indval as abundance-based indicator value has the best rank based on abundance-based evaluator. Results also indicated that FPFI was a reliable assignment index, achieving the second rank after TFVI based on r when considering incidence-based evaluators and similar rank as TFVI based on r when considering all evaluators in the B. hyrcana dataset. FPFI also obtained the third rank after Indval and r by considering all evaluators in T. baccata dataset.

4 Discussion

Finding efficient, simple and precise rules for the assignment of new or misclassified relevés to existing vegetation units is an important topic in vegetation science (Bruelheide 1997; Černá and Chytrý 2005; Tichý 2005; Dai et al. 2006; van Tongeren et al. 2008). In parallel, vegetation scientists often recommend refining the results produced using unsupervised classification methods before accepting vegetation units (Wiser and De Cáceres 2013; Tichý et al. 2014). These two tasks can be conducted employing species fidelity data. Tichý (2005) compared several similarity indices for the assignment of relevés to the vegetation units using simulated data. Among them, Tichý (2005) recommended FPFI (which combines frequency information with fidelity values) for the assignment of relevés to preexisting vegetation units. Following a similar approach, Dai et al. (2006) and Esmailzadeh and Asadi (2014) developed TFVI and TPFI, respectively, based on fidelity and the cover percentage of each species. Esmailzadeh and Asadi (2014) concluded that TPFI using a group-equalized phi fidelity index could be used as an approach to improve TWINSPAN results. In this paper we generalized TFVI and TPFI into a single framework, which we called TFVI, for the assignment of relevés to existing vegetation units based on fidelity values and the cover percentage of each species. We sought to determine the most suitable fidelity measure to be used in the TFVI framework. For this purpose, we took the vegetation units derived from a B. hyrcana as well as T. baccata datasets and tested the performance of the FPFI assignment rule and the TFVI framework using eight different fidelity measures. Despite only testing assignment rules on a single (but real) dataset, we obtained some interesting findings, which we describe in the following paragraphs.

We found that assignments using the TFVI framework often had higher predictive performance than assignments using FPFI. The reason for this result may be related to the usage of cover percentage (i.e., species abundance data) as a weighting criterion instead of species frequency (i.e., presence/absence data). Assuming that a high percentage cover of a species implies more favorable environmental conditions than its frequency, the weighting of fidelity values by cover percentage causes the influence of each species to be related to the availability of favorable conditions. In the TFVI framework, species with high cover percentage as well as high fidelity for the target unit will be more influential in assignments.

We also found that using group-equalized fidelity indices led to better results in the TFVI framework, both in terms of predictive performance and quality of the resulting classification, compared to the use of non-equalized fidelity indices. Diagnostic values analysis using non-equalized indices is biased towards common species (e.g., Chytrý et al. 2002; Tichý and Chytrý 2006). Group equalization allows assessing diagnostic value independently of the size of the data set and of the size of the target site group, resulting in a better treatment of species rarity in fidelity calculations (Tichý and Chytrý 2006). In the case of the phi coefficient, another advantage of group equalization is that for each species, the order of its relative frequencies within different vegetation units is the same as the order of its fidelities to those vegetation units (Tichý and Chytrý 2006). Our results emphasize the importance of group size equalization not only for diagnostic value calculations but also for assignments of relevés based on fidelity values.

TFVI assignments based on Indval preserved the initial number of groups regardless of the method used to produce the initial classification (i.e., modified TWINSPAN, k-means or PAM). Hence, TFVI based on Indval can be considered superior to other assignment rules in the sense that it does not produce strong alterations of the vegetation concepts in the original (unsupervised) classification. In addition, classifications obtained using assignments based on Indval resulted in the highest predictive performance. If assigning new relevés to existing vegetation units is the only usage of the assignment rule, Indval should be recommended because of its higher predictive power. However, if the assignment rule is applied for the refinement of a classification, other fidelity measures may also be suitable, because in this case having a good predictive performance may not be as important as improving the quality of the classification.

The choice of an evaluator index often implies a bias in the evaluation towards classification procedures that better match the concepts considered important in the conception of the evaluator index. Our quality evaluation results also showed that the ranking of evaluators is influenced by the type of vegetation data (i.e. incidence or abundance- based). Actually, in our analysis non-geometric evaluators (i.e., ISAMIC and Morisita) as well as incidence-based evaluators indicated that classifications obtained using correlation measures followed by FPFI were better than IndVal-based reassignments. Since the calculation of non-geometric and incidence-based evaluators is also based on frequency values, it can be said that the results of these evaluators were biased towards classification solutions obtained using phi coefficients such as TFVI based on r and FPFI. A similar bias could be argued towards abundance-based site-group association measures when using abundance-based evaluators. We found Indval to perform very well with abundance-based evaluators but this was not the case for other abundance-based measures such as IndVal.

In terms of the overall quality of the resulting classification, our results indicate that the TFVI framework works better when the chosen fidelity measure is either Indval or r. Indeed, comparisons based on all evaluators indicated that TFVI based on \({Indval}_{ind}^{g}\) was the first option, followed by r and FPFI. There are two differences between Indval and r. One is that Indval uses abundance data for the calculation of fidelity values, whereas r does not (but remember that the TFVI framework uses species abundance values for assignments regardless of the fidelity measure). The second difference is that Indval does not take into account species absences values outside the target site group. The fact that absences outside the target site group contribute to the strength of association in \(r_{\varphi }^{g}\) suggests a potential overestimation of the fidelity value (De Caceres et al. 2008). In this sense, De Cáceres and Legendre (2009) mentioned that indicator value indices have the advantage, compared to correlation indices, of being less dependent on the context of fidelity determination.

5 Conclusion

While the results of our analysis based on two vegetation datasets in the Hyrcanian forests points towards a slight preference of Indval over r for relevé assignments, we acknowledge that additional studies are necessary, using both simulated and real datasets, before more conclusive recommendations can be made in favor of one or another. Given its good performance in the context of the TFVI framework, one could ask whether Indval, being based on species abundances, should also be preferred over phi fidelity indices in the FPFI framework too. Additional work is also needed to test this hypothesis. It may well be the case that the preference for one site-group association measure or another depends on both the intended usage of the assignment rule (i.e., for assigning new relevés to an existing classification vs. refining an initial classification) and on the kind of vegetation considered (e.g., forests vs. grasslands, or species-poor vs. species-rich vegetation).

Data availability

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

FPFI:

Frequency-positive fidelity index

TFVI:

Total fidelity value index

TPFI:

Total phi fidelity Index

TIVI:

Total indicator value index

TWINSPAN:

Two-way indicator species analysis

PAM:

Partitioning around medoids

r :

Correlation indices of fidelity

IndVal:

Indicator value indices

ARI:

Adjusted rand index

PARTANA:

Partition analysis

PBC:

Point biserial correlation

IndValind :

Indicator value indices for species abundance

INdValpa :

Indicator value indices for presence-absence data

r ø :

Correlation indices based on presence-absence data

r ind :

Correlation indices based on abundance data; indices with superscript g are considered as a group-equalized and without g are considered as a non-equalized

References

  • Aho K (2017) Plant Ecology: Computational tools for plant ecology. R package version 0.5-1. https://sites.google.com/a/isu.edu/aho/

  • Aho K, Roberts DW, Weaver T (2008) Using geometric and non-geometric internal evaluators to compare eight vegetation classification methods. J Veg Sci 19(4):549–562

    Google Scholar 

  • Asadi H, Hosseini SM, Esmailzadeh O, Ahmadi A (2011) Flora, Life form and chorological study of Box tree (Buxus hyrcana Pojark.) sites in Khybus protected forest Mazandaran. J. Plant Biol 3:27–40 ([In Persian])

    Google Scholar 

  • Boublík K (2010) Formalized classification of the vegetation of Abies alba-dominated forests in the Czech Republic. Biologia 65:822–831

    Google Scholar 

  • Boublík K, Petrik P, Sadlo J, Hedl R, Willner W, Cerny T, Kolbek J (2007) Calcicolous beech forests and related vegetation in the Czech Republic: a comparison of formalized classifications. Preslia 79(2):141–161

    Google Scholar 

  • Brogden HE (1949) A new coefficient: application to biserial correlation and to estimation of selective efficiency. Psychometrika 14:169–182

    CAS  PubMed  Google Scholar 

  • Bruelheide H (1997) Using formal logic to classify vegetation. Folia Geobotanicaet Phytotaxonomica 32:41–46

    Google Scholar 

  • Černá L, Chytrý M (2005) Supervised classification of plant communities with artificial neural networks. J Veg Sci 16:407–414

    Google Scholar 

  • Chytrý M, Tichý L (2003) Diagnostic, constant and dominant species of vegetation classes and alliances of the Czech Republic: a statistical revision. Masaryk University, Brno, CZ

    Google Scholar 

  • Chytrý M, Tichý L (2018) National vegetation classification of the Czech Republic: a summary of the approach. Phytocoenologia 48:121–131

    Google Scholar 

  • Chytrý M, Tichý L, Holt J, Botta-Dukát Z (2002) Determination of diagnostic species with statistical fidelity measures. J Veg Sci 13:79–90

    Google Scholar 

  • Clarke KR (1993) Non-parametric multivariate analyses of changes in community structure. Aust J Ecol 18:117–143

    Google Scholar 

  • Dai X, Page B, Duffy KJ (2006) Indicator value analysis as a group prediction technique in community classification. S Afr J Bot 72:589–596

    Google Scholar 

  • De Cáceres M (2016) How to use the “indicspecies” package. R package (ver. 17.1)

  • De Cáceres M, Font X, Oliva F (2008) Assessing species diagnostic value in large data sets: a comparison between phi-coefficient and Ochiai index. J Veg Sci 19:779–788

    Google Scholar 

  • De Cáceres M, Legendre P, Wiser SK, Brotons L (2012) Using species combinations in indicator value analyses. Methods Ecol Evol 3:973–982

    Google Scholar 

  • De Cáceres M, Legendre P (2009) Associations between species and groups of sites: indices and statistical inference. Ecology 90:3566–3574

    PubMed  Google Scholar 

  • Dengler J, Chytrý M, Ewald J (2008) Phytosociology. In: Jørgensen SE, Fath BD (eds) Encyclopedia of ecology. Elsevier, Amsterdam, The Netherlands, pp 2767–2779

    Google Scholar 

  • Douda J (2008) Formalized classification of the vegetation of alder carr and floodplain forests in the Czech Republic. Preslia 80:199–224

    Google Scholar 

  • Dufrêne M, Legendre P (1997) Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecol Monogr 67:345–366

    Google Scholar 

  • Esmailzadeh O, Asadi H (2014) Total phi fidelity index (TPFI) as a new algorithm in plant communities analysis. Iran J Forest 6:215–232 ([In Persian])

  • Esmailzadeh O, Asadi H, Ahmadi A (2014) Phytosociology of Khybus Protected Area. J Wood Forest Sci Technol 19:1–20 ([In Persian])

  • Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust Version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington.

  • Gégout JC, Coudun C (2012) The right relevé in the right vegetation unit: a new typicality index to reproduce expert judgement with an automatic classification programme. J Veg Sci 23(1):24–32

    Google Scholar 

  • Guerra L, Robles V, Bielza C, Larrañaga P (2012) A comparison of clustering quality indices using outliers and noise. Intell Data Anal 16(4):703–715

    Google Scholar 

  • Habibi KH (1992) Fundaments of forest soil science. Tehran University Press, Tehran, IR ([In Persian])

    Google Scholar 

  • Hill MO (1979) TWINSPAN: a FORTRAN program for arranging multivariate data in an ordered two-way table by classification of the individuals and attributes. Cornell University, Ithaca, NY, Section of Ecology and Systematics

    Google Scholar 

  • Hill MO (1989) Computerized matching of relevés and association tables, with an application to the British National Vegetation Classification. Vegetatio 83:187–194

    Google Scholar 

  • Horn HS (1966) Measurement of ‘overlap’ in comparative ecological studies. Am Nat 100:419–424

    Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Google Scholar 

  • Hubert LJ, Levin JR (1976) A general framework forassessing categorical clustering in free recall. Psychology Bulletin 83:1072–1080

    Google Scholar 

  • Janišová M, Uhliarová E, Ružičková H, Jandt U, Becker T, Dengler J (2010) Expert system-based classification of semi-natural grasslands in submontane and montane regions of central Slovakia. Tuexenia 30:375–422

    Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York, US

    Google Scholar 

  • Landucci F, Gigante D, Venanzoni R, Chytrý M (2013) Wetland vegetation of the class Phragmito-Magno-Caricetea in central Italy. Phytocoenologia 43:67–102

    Google Scholar 

  • Legendre P, Gallagher ED (2001) Ecologically meaningful transformations for ordination of species data. Oecologia 129:271–280

    PubMed  Google Scholar 

  • Legendre P, Legendre L (2012) Numerical ecology, 3rd edn. Elsevier, Amsterdam, NL

    Google Scholar 

  • Luther-Mosebach J, Dengler J, Schmiedel U, Röwer IU, Labitzky T, Gröngröft A (2012) A first formal classification of the Hardeveld vegetation in Namaqualand, South Africa. Appl Veg Sci 15:401–431

    Google Scholar 

  • Maciejewski L, Pinto PE, Wurpillot S, Drapier J, Cadet S, Muller S, Agou P, Renaux B, Gégout JC (2020) Vegetation unit assignments: phytosociology experts and classification programs show similar performance but low convergence. Appl Veg Sci. https://doi.org/10.1111/avsc.12516

    Article  Google Scholar 

  • Mac Queen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L. & Neyman, J. (eds.) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp. 281–297. University of California Press, Berkeley, US.

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M (2013) Package ‘cluster’. Dosegljivo na

  • Marvie Mohadjer MR (2005) Silviculture. University of Tehran Press, Tehran, IR ([In Persian])

    Google Scholar 

  • Mueller-Dombois D, Ellenberg H (1974) Aims and methods of vegetation ecology. John Wiley, New York, US

    Google Scholar 

  • Podani J, Csányi B (2010) Detecting indicator species: some extensions of the IndVal measure. Ecol Indic 10(6):1119–1124

    Google Scholar 

  • Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’hara, RB. Oksanen, MJ (2013) Package ‘vegan’. Community ecology package, version 2(9), 1–295

  • Roberts D (2005) Vegetation classification in R, for labdsv ver. 1.1–1, vegetation ecology package. http://www.cran.r-project.org

  • Roberts DW (2010) Optpart: optimal partitioning of similarity relations. R package version, 2-0

  • Roberts DW (2015) Vegetation classification by two new iterative reallocation optimization algorithms. Plant Ecol 216(5):741–758

    Google Scholar 

  • Roberts DW,  Roberts MDW, (2016) Package ‘labdsv’. Ordination and Multivariate, 775

  • Rodríguez-Rojo MP, Fernández-González F, Tichý L, Chytrý M (2014) Vegetation diversity of mesic grasslands (Arrhenatheretalia) in the Iberian Peninsula. Appl Veg Sci 17:780–796

    Google Scholar 

  • Roleček J, Tichý L, Zelený D, Chytrý M (2009) Modified TWINSPAN classification in which the hierarchy respects cluster heterogeneity. J Veg Sci 20:596–602

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Google Scholar 

  • Sagheb-Talebi K, Sajedi T, Pourhashemi M (2014) Forests of Iran: A Treasure from the Past, a Hope for the Future. Springer, Berlin, GE

    Google Scholar 

  • Soleymanipour SS, Esmailzadeh O (2015) Flora, life form and chorology of Box trees (Buxus hyrcana) habitats in forests of the Farim area of Sari. Taxonomy and Biosystematics 7:39–56 ([In Persian])

    Google Scholar 

  • Svitková I, Šibík J (2013) An expert-based classification of high-altitude arctic-alpine vegetation of the class Caricirupestris-Kobresietea Ohba 1974: achievements and obstacles. Plant Biosystems 147:315–327

    Google Scholar 

  • Tichý L (2002) JUICE, software for vegetation classification. J Veg Sci 13:451–453

    Google Scholar 

  • Tichý L (2005) New similarity indices for the assignment of relevés to the vegetation units of an existing phytosociological classification. Plant Ecol 179:67–72

    Google Scholar 

  • Tichý L, Chytrý M (2006) Statistical determination of diagnostic species for site groups of unequal size. J Veg Sci 17:809–818

    Google Scholar 

  • Tichý L, Chytrý M, Botta-Dukát Z (2014) Semi-supervised classification of vegetation: preserving the good old units and searching for new ones. J Veg Sci 25:1504–1512

    Google Scholar 

  • Tichý L, Chytrý M, Hájek M, Talbot SS, Botta-Dukát Z (2010) OptimClass: using species-to-cluster fidelity to determine the optimal partition in classification of ecological communities. J Veg Sci 21(2):287–299

    Google Scholar 

  • Van der Maarel E (1979) Multivariate methods in phytosociology, with reference to the Netherlands. Junk The Hague, NL

    Google Scholar 

  • van Tongeren O, Gremmen N, Hennekens S (2008) Assignment of relevés to pre-defined classes by supervised clustering of plant communities using a new composite index. J Veg Sci 19:525–536

    Google Scholar 

  • Westhoff V, van der Maarel E (1973) The Braun-Blanquet approach. In: Whittaker RH (ed) Ordination and classification of plant communities. W. Junk, The Hague, NL, pp 617–737

    Google Scholar 

  • Whittaker RH (1962) Classification of natural communities. The Botanical Review 28:1–239

    Google Scholar 

  • Wiser SK, De Cáceres M (2013) Updating vegetation classifications: an example with New Zealand’s woody vegetation. J Veg Sci 24:80–93

    Google Scholar 

Download references

Acknowledgments

We are grateful to Erwin Dreyer, Laurent Bergès, and two anonymous reviewers for their valuable comments on the former version of the manuscript, and Saleh Sekhavat and Abbas Ahmadi for their tireless help with field sampling of B. hyrcana forests and laboratory work. Also, we would like to thank Pari Karami, Meysam Soufi, and Mohammad-Reza Abbaszadeh for providing T. bacata dataset.

Funding

This study is financially supported by the Tarbiat Modares University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omid Esmailzadeh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

The authors declare that they obtained the approval of Research Council of Natural Resources Faculty in Tarbiat Modares University and in all correspondence with permission from Iranian Forest, Range and Watershed Organization ethics committee for conducting the study in Hyrcanian forests.

Additional information

Handling Editor: Laurent Bergès

Contribution of the co-authors HA and OE originally formulated the idea, HA, OE and MD developed methodology, HA and OE conducted fieldwork, HA, OE and MD performed statistical analyses, and HA, OE, MD and SYD wrote the manuscript.

Appendices

Appendix 1

Result of OptimClass 1 analysis applied to the 40 clustering solution using Fisher’s exact test with cut level of 10–8 for species to be considered as faithfull. Each curve represents the number of faithfull species from one cluster analysis defined by a uniqe combination of cover transformation, distance measures and group linkage. (a) and (b) are related to B. hyrcana and T. baccata dataset, respectively.

figure a

Appendix 2

2.1 Equations and descriptions for all evaluators

Morisita’s Index of niche overlap with the aim of explaining of floristically niche overlap for a particular classification solution (Aho et al 2008) was used to evaluate the separability of vegetation units. This is calculated using proportional occurrence of each species in a group with respect to all species in that group (Eq. 5). CH scores were calculated for all pairwise niche overlap measures between groups. For g groups: [(g2 – g)/2 = no. of comparisons]. The mean of these pairwise relationships was calculated to represent a measure of cluster overlap for a particular classification solution. Minimal niche overlap indicates optimal solutions.

$$C_{H} = \frac{{2\sum\limits_{i = 1}^{n} {p_{ij} \times p_{ik} } }}{{\sum\limits_{i = 1}^{n} {p_{ij}^{2} + \sum\limits_{i = 1}^{n} {p_{ik}^{2} } } }}$$
(5)

CH = adapted Morisita index of niche overlap; pij = proportional occurrence of species i with respect to all species in group j; pik = proportional occurrence of species i with respect to all species in cluster k; n = total number of species.

ISAMIC-indicator species analysis to minimize intermediate constancy (Roberts 2010) is an analysis that measures the constancy (either high or low) of species within clusters irrespective of how many clusters a species occurs in (Robert 2015). It calculates the degree to which species are either always present or always absent within clusters (Robert 2010). The index is calculated as Eq. 6.

$$IS = \frac{{\sum\limits_{j = 1}^{n} {\left( {\left( {2\sum\limits_{g = 1}^{k} {\left| {C_{jg} - 0.5} \right|} } \right)/k} \right)} }}{n}$$
(6)

IS = ISAMIC score; n = the number of species; k = is the number of groups; cjg = is the constancy of species j in group g. The statistic is bounded 0–1 and higher values are better.

ISA-indicator species analysis (Aho et al. 2008) is calculated by IndVal index (Dufrene and Legendre 1997). The IndVal index combines species fidelity (the proportion of sites in which species j is present within group g) with species specificity to groups (is the ratio of the mean abundance of species j in group g and the sum of means of the same species over all groups) to calculate a combined index (Podni and Csanyi 2010). Fidelity is calculated as relative frequency (Eq. 7), and its range is 0–1. The minimum obtained when the species is absent from group k and maximum resulting when the species occurs in every site in that group. Specificity is obtained as Eq. 8. These two terms are multiplied and then scaled to 100 to express the indicator value of species j in group g in terms of percentage (Eq. 9). By the way, each species is assigned an indicator value for every group and the largest value is tested for significance with Monte Carlo procedure, resulting in p values. Finally, the number of significant indicators at α = 0.05 was derived from this test and expressed as ISA of each clustering solutions.

$$B_{jg} = \sum\limits_{g = 1}^{k} {\frac{{np_{jg} }}{{Np_{j} }}}$$
(7)
$$A_{jg} = \frac{{\sum\limits_{i = 1}^{{n_{g} }} {{{a_{jig} } \mathord{\left/ {\vphantom {{a_{jig} } {n_{g} }}} \right. \kern-\nulldelimiterspace} {n_{g} }}} }}{{\sum\limits_{g = 1}^{k} {\sum\limits_{i = 1}^{{n_{g} }} {{{a_{jig} } \mathord{\left/ {\vphantom {{a_{jig} } {n_{g} }}} \right. \kern-\nulldelimiterspace} {n_{g} }}} } }}$$
(8)
$$IndVal_{jg} = 100 \times A_{jg} \times B_{jg}$$
(9)

Bjk = species fidelity; npjg = number of occurrences of species j in the group g; Npj = number of occurrences of species j in all group; ajig = abundance of species j in sample unit i of group g; ng = number of sample units in group g.

C-index (Hubert and Levin, 1976) is defined as Eq. 10, where dw is the sum of within cluster similarity for all clusters. If p is the number of pairs of relevés in the same vegetation units, pairs of samples for which both samples are located in the same units, max (dw) and min (dw) are the sum of the p- pairs of relevés with largest and smallest distances respectively. This index is confined to interval 0–1 and minimum C-index scores were considered as optimal clustering solutions of TFVI models (Aho et al. 2008).

$$C_{index} = \left( {\frac{{d_{w} - \min (d_{w} )}}{{\max (d_{w} ) - \min (d_{w} )}}} \right)$$
(10)

PARTANA ratio index (Eq. 11) calculates the ratio of the mean within-cluster similarity to the mean among-cluster similarity (Roberts 2015) so in this research it was used as a goodness of vegetation units which are classified by TFVI models. High PARTANA value implies low within group dissimilarity and high dissimilarity of relevés within groups to relevés outside of groups (Aho et al. 2008) so higher PARTANA values, were considered as optimal clustering solutions of TFVI models.

$$P = \frac{{{{\sum\limits_{z = 1}^{C} {\sum\limits_{\begin{subarray}{l} i = 1 \\ i \in z \end{subarray} }^{N - 1} {\sum\limits_{\begin{subarray}{l} j = i + 1 \\ j \in 1 \end{subarray} }^{N} {S_{ij} } } } } \mathord{\left/ {\vphantom {{\sum\limits_{z = 1}^{C} {\sum\limits_{\begin{subarray}{l} i = 1 \\ i \in z \end{subarray} }^{N - 1} {\sum\limits_{\begin{subarray}{l} j = i + 1 \\ j \in 1 \end{subarray} }^{N} {S_{ij} } } } } {\sum\limits_{z = 1}^{C} {\left( {n_{z}^{2} - n_{z} } \right)/2} }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{z = 1}^{C} {\left( {n_{z}^{2} - n_{z} } \right)/2} }}}}{{{{\sum\limits_{i = 1}^{N - 1} {\sum\limits_{\begin{subarray}{l} j = i + 1 \\ \omega_{i} \ne \omega_{j} \end{subarray} }^{N} {S_{ij} } } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{N - 1} {\sum\limits_{\begin{subarray}{l} j = i + 1 \\ \omega_{i} \ne \omega_{j} \end{subarray} }^{N} {S_{ij} } } } {\sum\limits_{z = 1}^{C - 1} {\sum\limits_{k = z + 1}^{C} {n_{z} .n_{k} } } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{z = 1}^{C - 1} {\sum\limits_{k = z + 1}^{C} {n_{z} .n_{k} } } }}}}$$
(11)

where i and j are relevé, C is the number of vegetation units, N is total number of relevé, nk is the number of relevé in unit k, iεz indicates that relevé i is a member of unit z, ω is membership, and ωi ≠ ωj indicates that relevé i and j are not members of the same units. nz is the number of relevé in the z unit (z = 1, 2, …, C). Sij is the similarity of two relevé i and j. Sij = ((1-dij)/max dij) and dij is maximum of all possible pairwise Hellinger Euclidean distances.

Point biserial correlation (PBC) is a correlation measure between a continuous variable (i.e., distance measure) and a binary variable (i.e., a variable whose values are 0 or 1). PBC was defined as Pearson correlation of D and B matrixes (Eq. 12). D is distance data matrix of all possible pairwise Hellinger Euclidean distances and B is a symmetric matrix of ones and zeroes with the same dimensions as D; 0 = relevés in the same vegetation units, 1 = relevés in different vegetation units. The higher PBC values were considered as optimal clustering solutions of TFVI models.

$$PBC = corr(D,B)$$
(12)

Silhouette coefficient (SW) combine ideas of both cohesion (measures how closely related are objects in a cluster) and separation (measure how distinct or well separated a cluster is from other clusters), but for individual points as well as clusters. Cohesion is the within- cluster mean distance a(i) as the average dissimilarity between object i and all other objects in the cluster to which i belongs (K for instance) (Eq. 13). Separation is the between- cluster minimum distance b(i) as the minimum average dissimilarity to the instances of each cluster that are different to K (Eq. 14). The silhouette coefficient is calculated for each object i based on its cohesion and separation values as Eq. 15. The average of all output values for each object is called the average silhouette width (ASW) which is the final result and is in the [− 1, 1] range. A high value indicates good quality clusters (Guerra et al. 2012).

$$a(i) = \frac{1}{{n_{k} - 1}}\sum\limits_{{\mathop g\limits_{{\mathop {i \ne i^{\prime}}\limits_{{i,i^{\prime} \in g}} }} }}^{k} {d(p_{i} ,p_{{i^{\prime}}} )}$$
(13)
$$b(i) = \min \left[ {\frac{1}{{n_{{k^{\prime}}} }}\sum\limits_{{\mathop {i,j = 1}\limits_{{\mathop {i \in k}\limits_{{j \in k^{\prime}}} }} }}^{{n_{k} ,n_{{k^{\prime}}} }} {d(p_{i} ,p_{j} )} } \right]$$
(14)
$$s(i) = \frac{b(i) - a(i)}{{\max (b(i),a(i))}}$$
(15)

Analysis of similarities (ANOSIM) as a test of the significance of the groups that had been defined a priori (Clarke 1993). It provides a way to test statistically whether there is a significant difference between two or more groups of sampling units (Oksanenen 2013). This method uses the results of dissimilarity matrix which is derived by a suitable distance index; however, the rank order of dissimilarity values is used. If two groups of sampling units are really different in their species composition, then compositional dissimilarities between the groups ought to be greater than those within the groups. The ANOSIM statistic (R) is based on the difference of mean ranks between groups (rB) and within groups (rW) which is showed by Eq. 16. R is bounded in the [− 1, o, 1], which R = 1 indicates that all the most similar samples are within the same groups. R = 0 occurs if the high and low similarities are perfectly mixed and bear no relationship to the group. A value of − 1 indicates that the most similar samples are all outside of the groups. The null hypothesis is therefore that there are no differences between the members of the various groups. The statistical significant is done by Monte-Carlo permuting test. If the value of R is significant, you can conclude that there is evidence that the samples within groups are more similar than would be expected by random chance.

$$R = \frac{{\overline{r}_{B} - \overline{r}_{W} }}{n(n - 1)/4}$$
(16)

\(\overline{r}_{B}\) and \(\overline{r}_{W}\) are the mean of the ranked similarity between groups and within groups, respectively, and n is the total number of samples (objects).

Appendix 3

The average correction ratios of group’s penalty for the evaluator’s values

Classification solutions

Modified TWINSPAN

K-means

PAM

B. hyrcana dataset

  FPFI

0.96

1.00

1.00

  IndVal

0.68

0.91

0.92

  Indval

1.00

1.00

1.00

  IndVal

0.40

0.59

0.69

  Indval

0.87

0.85

0.92

  r

0.95

1.00

1.00

  r

0.91

1.00

1.00

  r

0.92

1.00

1.00

  r

0.87

1.00

1.00

T. baccata dataset

  FPFI

1.00

1.00

1.00

  IndVal

0.94

0.94

0.94

  Indval

1.00

1.00

1.00

  IndVal

1.00

1.00

1.00

  Indval

0.94

0.82

0.88

  r

0.94

1.00

0.94

  r

0.94

1.00

0.88

  r

1.00

1.00

1.00

  r

1.00

1.00

1.00

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asadi, H., Esmailzadeh, O., De Cáceres, M. et al. The assignment of relevés to pre-existing vegetation units: a comparison of approaches using species fidelity. Annals of Forest Science 78, 13 (2021). https://doi.org/10.1007/s13595-020-01017-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13595-020-01017-0

Keywords