Skip to main content

A genomic dataset of single‐nucleotide polymorphisms generated by ddRAD tag sequencing in Q. petraea (Matt.) Liebl. populations from Central-Eastern Europe and Balkan Peninsula

A Correction to this article was published on 28 June 2021

This article has been updated

Abstract

Key message

This genomic dataset provides highly variable SNP markers from georeferenced natural Quercus petraea (Matt.) Liebl. populations collected in Bulgaria, Hungary, Romania, Serbia, Bosnia and Herzegovina, Kosovo* and Albania. These SNP loci can be used to assess genetic diversity, differentiation, and population structure, and can also be used to detect signatures of selection and local adaptation. The dataset can be accessed at https://doi.org/10.5281/zenodo.3908963/ (Tóth et al. 2020 ). Associated metadata available at https://metadata-afs.nancy.inra.fr/geonetwork/srv/fre/catalog.search#/metadata/b6fee4fa-01e9-44d0-92f5-ad19379f9693 .

1 Background

In the last decade, restriction site–associated DNA (RAD-seq) sequencing is increasingly used to identify and genotype large numbers of single-nucleotide polymorphisms (SNPs) in various organisms (Allendorf et al. 2010; Davey and Blaxter 2010). Due to the advancements of high-throughput sequencing, this approach allows to produce extremely large collections of genomic data that is fairly evenly distributed across the genome providing hundreds of thousands of polymorphic loci (Allendorf et al. 2010; Seeb et al. 2011). A variation of the approach, known as ddRAD-seq (double-digest RAD-seq), has been widely applied for population genomic analyses in non-model organisms, such as forest tree species, lacking available reference genomes (Konar et al. 2017). ddRAD-seq is suitable for genotyping a large number of loci to determine the extent of genetic diversity, to identify outlier loci being under selection (Schwartz et al. 2010; Burak et al. 2018) and to infer eco-evolutionary population dynamics (Legrand et al. 2017; Hendry 2019).

While forest tree populations are well-studied in Europe, the Central-Eastern European region, including the Balkan Peninsula is less investigated with high-resolution and genome-wide genetic markers. In fact, this region has high topographic and climatic heterogeneity, and long-term environmental stability that significantly contributed to long-term persistence of genetic diversity (Tzedakis 2004; Feliner 2011; Fassou et al. 2020). In addition, the Balkan Peninsula has been postulated to be the primary source of post-glacial expansion while functioning as a Pleistocene glacial refugia (Zhelev 2017; Gömöry et al. 2020). Therefore, this area is considered an important reservoir of genetic resources of European forest tree taxa (Gömöry et al. 2020).

Sessile oak (Quercus petraea (Matt.) Liebl.) is a widely distributed late‐successional forest tree species of the white oak group (sect. Lepidobalanus) with great ecological and silvicultural importance in Europe (Mölder et al. 2019; Blanc-Jolivet et al. 2020). However, the species’ distribution towards the southern limit on the Balkan Peninsula tends to be fragmented (Bruschi et al. 2003; Eaton et al. 2016), and being under intensive forest management and severe ecosystem disturbances such as drought (Milad et al. 2011). Similarly, species’ intraspecific genetic differentiation increases towards the southeast due to its complex evolutionary history. Since current studies mostly focusing on western European populations, the southeastern European distribution range is poorly sampled and investigated (Lang et al. 2020; Leroy et al. 2017, 2020; Kremer and Hipp 2020). This inequality might potentially lead to underestimation of regional genetic diversity. For these reasons, we generated a novel resource of genome-wide SNPs, using ddRAD-seq, in carefully selected natural populations.

2 Methods

2.1 Sampling strategy

The sampling of plant material was designed to cover the southeastern natural distribution range of Q. petraea. Sites were selected based on literature data and personal communication with local forestry experts (contributing authors). For literature data, available information on neutral genetic patterns in European white oaks was reviewed (Dumolin-Lapegue et al. 1997; Gömöry et al. 2001, 2020; Bordács et al. 2002; Csaikl et al. 2002; Petit et al. 2002a, b; Slade et al. 2008; Dering et al. 2008), and the geographic regions corresponding to the major genetic clusters were included in a candidate list of locations. Existing databases of forest genetic resources also contributed in sampling site selection. The European Information System on Forest Genetic Resources (EUFGIS, 2019; http://portal.eufgis.org/), the georeferenced database of forest conservation units, and the European Forest Reproductive Material Information System (FOREMATIS, 2019; https://ec.europa.eu/forematis) were searched and filtered, and the records of “source-identified” and “autochthonous/indigenous” locations were taken into account. Altogether, 180 individuals from 18 autochthonous populations were selected from Central-Eastern Europe and from the Balkan Peninsula (Table 1, Fig. 1). Out of these, four populations situated in Bulgaria (western Balkan, Rila, Rhodope, and Strandzha mountains), three in Hungary (Kőszeg, Bakony, and Mecsek mountains), Romania (Gurghiu, northern and southern foothills of Fagaras mountains), Serbia (Fruska Gora, Rudnik, and Stovoli mountains), Bosnia and Herzegovina (Kozara, Javorova, and Maglic mountains), and one in Kosovo* (Blinaja) and Albania (Djeravica Mt.), respectively (Tóth et al. 2020). Our sampling sites are found in natural sites either as part of nature reserves or without non-natural disturbance and human influence. Accordingly, sessile oak populations were regenerated naturally. Fresh leaves were collected from even-aged mature trees (> 70‐year‐old) by considering at least a 30-m isolation distance between trees to limit sampling-related individuals. Spatial coordinates (GPS) of each tree in each population were recorded. Leaf samples were both dried using silica gel and stored frozen (− 20 °C) in plastic bags before DNA extraction.

Table 1 Summary information on sampling locations of Q. petraea populations
Fig. 1
figure 1

Distribution of Q. petraea sampling locations across Europe (orange triangles). Light green areas represent the natural distribution of the species according to EUFORGEN database (http://www.euforgen.org/)

2.2 Library preparation and RAD-tag sequencing

Total genomic DNA was extracted from seven leaf disks (5-mm diam.) per each individual using the method of Dumolin et al. (1995).

DNA of each Q. petraea sample was quantified by using the Qubit dsDNA BR Assay Kit and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). A total of 50 ng of DNA per sample was double-digested with 0.1–0.1 μl of PstI and MspI enzymes at 37 °C for 2 h (FastDigest restriction enzymes; Thermo Fisher Scientific, Waltham, MA, USA). The enzyme combination was selected based on the study of Cumer et al. (2018) and a preliminary restriction site analysis carried out in CLC Genomic Workbench version 12.0 (QIAGEN Bioinformatics, Hilden, Germany). Fragments were double-sided size selected using KAPA PureBeads, with a 0.55–0.80X solution/bead ratio (Roche, Basel, Switzerland) in order to isolate fragments in the range of 300–600 bp. After quantification, inserts (3 ng) were ligated to adapters (Table 2) with using T4 DNA Ligase according to the manufacturers’ protocol (Thermo Fisher Scientific, Waltham, MA, USA). Ligation products were purified using 0.8 vol KAPA PureBeads (Roche, Basel, Switzerland) and amplified by PCR with NEBNext Multiplex Oligos for Illumina (Dual Index Set 1; New England Biolabs, Ipswich, MA, USA) and KAPA HiFi Hotstart Ready Mix (Roche, Basel, Switzerland). The amount of 0.5–0.5 μl of i5 and i7 indexed primers was used per reaction. Thermal cycling conditions were as follows: a 3-min initial denaturation at 95 °C; 17 cycles of 30 s of denaturation at 95 °C, 30 s of annealing at 55 °C, and 30 s extension at 72 °C; and a final 5-min extension at 72 °C. The quality and quantity of the amplicon library were determined by using High Sensitivity DNA1000 ScreenTape system with 2200 Tapestation (Agilent Technologies, Santa Clara, CA, USA) and dsDNA HS Assay Kit with Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), respectively. Equimolarily pooled libraries were diluted to 10 pM for 2 × 301-bp paired-end sequencing with 600-cycle sequencing kit v3.1 (Illumina, San Diego, CA, USA). Nucleotide sequences of the libraries were determined on a MiSeq Sequencing System (Illumina, San Diego, CA, USA) according to the manufacturer’s protocol. Raw data have been deposited in the NCBI Sequence Read Archive (SRA); BioProject ID: PRJNA699096.

Table 2 Oligonucleotides sequences of adapters used in ddRAD-Seq

2.3 De novo assembly and SNP calling

All bioinformatics processing was carried out on a Silicon Computers (SGI) HPC server, allocating 40 cores (80 threads) and 38 GB RAM, located at the University of Sopron (UOS), Sopron, Hungary.

Raw short-read sequences (77,101,088) were demultiplexed and adapter-trimmed by using the MiSeq Control Software (Illumina, San Diego, CA, USA). In total, 69,993,001 sequences contained adapter sequences kept for further processing. Next, the implemented FastQ Toolkit was applied to trim bases at the 3′ and the 5′ end with a quality score lower than 30. Reads having a mean quality score less than 30 and shorter than 200 bp were filtered. Computational processing of short-read data was carried out with Stacks 2.0 (Catchen et al. 2013; Rochette et al. 2019). Whole reads were quality-filtered using a sliding-window method (15% of read length) implemented in “process_radtags.” Reads having a quality score below 90% (raw phred score of 10) were discarded (Catchen et al. 2011). In “process_radtags,” reads were truncated to 200 bp as a prerequisite of further processing and to avoid the lower-quality bases to present at the end of the reads (Catchen et al. 2011). During this filtering step, 42,273 sequences were discarded. In addition, one individual was removed from the dataset (BU2-10) since an insufficient number of high-quality reads remained after filtering. RAD loci were reconstructed following the de novo pipeline implemented in “denovo_map.pl” command, in which “ustacks” builds loci and calls SNPs in each sample, then “cstacks” creates a catalogue of all loci for all the samples and “sstacks” matches the loci of the samples against the catalogue (Catchen et al. 2011; Rochette et al. 2019). To execute “denovo_map.pl,” it is essential to optimize the parameters of the command, and define (m) the minimum number of reads required to form an allele, (M) the maximum number of mismatches allowed between two alleles, and (n) the maximum number of mismatches allowed between two individual loci to consider them as homologous (Mastretta-Yanes et al. 2015; Paris et al. 2017). We optimized these parameters throughout the “r80” method, which can effectively maximize the number of SNPs found in 80% of the individuals (Paris et al. 2017). We checked the iterative values (ranging from 2 to 10) of M and n how they affect the gained SNPs (Fig. 2a). After, based on this, we chose the threshold of M = 3, n = 3, applying the M = n rule for the final run (Paris et al. 2017). For the stack-depth parameter, we selected the default m = 3 value (three identical reads). The program “gstacks” was used to assemble paired-end contigs and re-call SNPs using the population-wide data (Rochette et al. 2019).

Fig. 2
figure 2

a Plot of the number of new polymorphic loci (r80 loci) gained for each iteration of M (the distance between stacks) and n (the maximum number of mismatches allowed between loci) of the preliminary test run of the “denovo_map.pl” pipeline in Stacks (Catchen et al. 2013; Rochette et al. 2019). The orange dot corresponds to the selected value. b Box plot of mean coverage of reads for each population (including all individuals)

The “denovo_map.pl” pipeline aligned paired-end reads for a total of 112,750 RAD loci in which 1490 loci had paired-end reads that could not be assembled (1.3%). The remaining 111,260 loci (98.7%) had an average contig size of 361.8 bp. The effective per-sample coverage resulted in 14.3 × (mean) with a 2.2 × standard deviation (min = 10.1 × , max = 27.0 ×) (Fig. 2b). Population-specific statistics are presented in Table 3.

Table 3 Detected polymorphisms and genomic coverage of Q. petraea populations

The set of SNPs were called and further processed with the “populations” program (Catchen et al. 2011), in which markers with a minor allele frequency < 0.05, missing individual rate > 0.8, and significant deviation from Hardy–Weinberg equilibrium (HWE, P < 0.001) were filtered out (Catchen et al. 2011). However, the “minimum number of populations” value was set to 1 (SNPs were kept when they were present in at least one population), allowing to be later filtered by the user for the specific research question. Out of the 111,260 RAD loci, 105,326 loci did not pass the missing rate of sample/population and were below the MAF threshold; also, 1099 loci were blacklisted and discarded having variants with significant HWE exception. The final dataset consisted of 5934 loci, composed of 2,443,502 bases, of which 21,951 are highly polymorphic variant sites (SNPs).

3 Access to data and metadata description

The SNP dataset is available from the ZENODO repository; https://doi.org/10.5281/zenodo.3908963 (Tóth et al. 2020). The data records are described in the metadata description files, available at https://metadata-afs.nancy.inra.fr/geonetwork/srv/fre/catalog.search#/metadata/b6fee4fa-01e9-44d0-92f5-ad19379f9693.

The data file presented in this paper is a PLINK-formatted UTF-8 encoded, white-space (tab)-delimited, set of ped and map files. PLINK is a publicly available and widely used software for genomic data manipulation and analysis, capable of converting to other file formats and can perform several population-based tests (Purcell et al. 2007). Within the “.ped” file, the first column corresponds to the population code (Pop_ID: 1–18), the second to the individual identifier. From column three to six, values are not provided (father ID, mother ID, sex, and phenotype are not defined). From column seven, two columns code an observed allele at each SNP position. Missing data are coded as “0 0” as it is defined in the standard PLINK data format (PLINK 1.07; http://zzz.bwh.harvard.edu/plink/). The “.map” file contains four columns, the first is the chromosome number (un: unknown due to de novo assembly), the second is the SNP ID, the third is the SNP genetic position (0: not defined), and the fourth column contains the physical position (bp) within the RAD loci. Users do not need to combine or merge the files; it is ready to be loaded and analyzed. Population IDs (Pop_ID), individual IDs (Ind_ID), and the corresponding GPS coordinates (GPS_1: latitude and GPS_2: longitude) are located in the Appendix Table 5. Further information can be requested from the corresponding author.

4 Technical validation

Validation of the dataset was performed with manual, visual, and numerical tests. The quality of raw short-read data, of all individuals (179), was assessed using FastQC v0.11.9 (Andrews 2010) by investigating per-base quality and per-sequence quality at three steps: at raw read state directly after sequencing, after sequence quality trimming (3′ and 5′ end), and following sequence processing (whole read filtering). Results were consistent with a standard Illumina run, and therefore considered to be of sufficiently high quality for further analysis (Kircher et al. 2011).

5 Reuse potential and limits

Data papers on genome-wide polymorphisms are extremely scarce especially on geographic regions that are sources of biodiversity and in case of forest trees lacking a reference genome. Our data can significantly contribute to the growing number of RAD-seq studies by providing genotype information on species’ rarely investigated southeastern distribution range.

To evaluate the potential of the data, at the population level, we calculated genetic diversity indices including the expected heterozygosity (He), observed heterozygosity (Ho), and the inbreeding coefficient (FIS) with R (R Core Team 2013) using the “adegenet” (Jombart and Ahmed 2011) and “hierfstat” (Goudet 2005) packages (Table 4). Trends of our resulting indices were compared to Blanc-Jolivet et al. (2020), since up until now, no SNP dataset (i.e., SNP genotype collection) from genomic resources has so far been published for Q. petraea. Since our estimators are in accordance with the Blanc-Jolivet et al. (2020) data, as well as with other publications, we can confirm that our panel of markers is suitable for population genomic studies.

Table 4 Diversity statistics calculated for each Q. petraea population during the data validation process

It is important to note, however, that polymorphisms from reduced representation methods, such as RAD-seq, might introduce biases in population genetics estimates (Arnold et al. 2013; Andrews et al. 2016), and phylogenetic reconstructions and genomic mappings might require even larger genomic coverage as well as higher SNP density (Sims et al. 2014; Eaton et al. 2017). Despite of these, our generated high-quality SNP resource consisting of 21,951 SNPs provides a valuable tool for future population genetics and genomics studies allowing to carry out wide variety of reference-genome-free investigations (Peterson et al. 2012). The available spatial information (GPS coordinates; Table S1) allows the analyses of spatial genetic pattern, structure (including admixture), and gene flow and hybridization. Due to the relatively large-scale genome representation, fine-scale determination of relationship among individuals and populations is also possible. For example, in the context of phylogeography, our genotype data may supplement studies on evolutionary history, migration from glacial refugia, and past hybridization events increasingly studied nowadays (Leroy et al. 2017, 2020). Since ddRAD-seq has the potential to generate strongly differentiating markers, our data could be used to investigate patterns of divergent selection processes, thus can infer population-level adaptation.

Overall, this dataset provides a unique opportunity to study the genetic background of natural populations and reveals important ecological and evolutionary insights into the present and future distribution of sessile oak.

Data availability

The datasets generated during and analyzed during the current study are available in the ZENODO repository, https://doi.org/10.5281/zenodo.3908963/.

Change history

References

Download references

Acknowledgements

The authors are grateful to Ivan Iliev (University of Forestry, Bulgaria), Lazar Kesic and Velisav Karaklić (University of Novi Sad, Serbia), Zoltán Barina (Hungarian Natural History Museum, Hungary), Ibrahim Muja (Ministry of Agriculture Forestry and Rural Development (MAFRD), Kosovo*), Naser Krasniqi (Kosovo* Forest Agency, Kosovo*), Vlado Kaurin, Alen Gačić, Aleksandar Jorgić, Stojan Božić, Mijović Goran, and Spomenko Stojanović; the PFE “ŠumeRS” a.d. Sokolac (Bosnia and Herzegovina); the Institute of Lowland Forestry and Environment (project number: 451-03-68/2020-14/200197); and the Romanian forest authorities from Sibiu and Arges counties for the assistance they offered with the collection of plant material and for the valuable discussions in the course of which they shared their insights. We also thank Zoltán Bihari (Xenovea Ltd.) for providing sequencing resources and useful comments on data analysis.

Funding

This study was supported by the Ministry of Agriculture of Hungary (Kaán Károly project, no.: EVgF/549/2018, EGF/178/2019), by the Ministry of Education, Science and Technological Development of the Republic of Serbia (project no.: 451–03-68/2020–14/200197), and by the National Science Fund of Bulgaria (research grant no.: KP-06-H26/4).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Endre Gy. Tóth.

Additional information

Handling Editor: Marianne Peiffer

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contribution of the co-authors

EGT, ZAK, KC, ABE, and ABO conceived the study. EGT, ZAK, KC, ABE, JDK, RT, VT, PÁ, SS, EV, MM, VD, ET, PZ, SO, and ABE contributed in sampling of plant materials. KC performed laboratory analysis. EGT and ZAK processed genomic data and performed statistical analyses and simulations. EGT wrote the manuscript. All co-authors provided feedback on the manuscript drafts.

This designation is without prejudice to positions on status, and is in line with UNSCR 1244 (1999) and the ICJ Opinion’s on the Kosovo* Declaration of Independence.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 30 KB)

Appendix

Appendix

Table 5 Identifiers of individuals (Ind_ID) and populations (Pop_ID), and their corresponding GPS coordinates (GPS_1: latitude and GPS_2: longitude)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tóth, E.G., Köbölkuti, Z.A., Cseke, K. et al. A genomic dataset of single‐nucleotide polymorphisms generated by ddRAD tag sequencing in Q. petraea (Matt.) Liebl. populations from Central-Eastern Europe and Balkan Peninsula. Annals of Forest Science 78, 43 (2021). https://doi.org/10.1007/s13595-021-01051-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13595-021-01051-6

Keywords