Genome complexity reduction achieved with a double restriction digestion
The genome complexity reduction consisted of digestions using a combination between three rare cutter enzymes (PstI, SphI y EcoRI) and two frequent cutter enzymes (MboI y MspI) and by comparing the results with a single restriction reaction (with ApeKI). According to in silico simulation of restriction digestion of the complete peach genome, the PstI/MboI combination would produce the highest number of loci (63,730 loci) within the selected size values (300–400 bp) in relation to the other enzyme pair combinations (Supplementary Table S2). Moreover, PstI/MboI performed better than the ApeKI restriction digestion, within a range between 300 and 800 bp (Supplementary Table S2). That range corresponds to the one obtained in the regular GBS protocols, which do not perform a direct size selection, as ddRADseq do. Therefore, an indirect size selection using short PCR amplification cycles added to purification with low concentrations of Ampure Beads XP could be considered23,36.
In silico simulations were in accordance with the in vitro digestions, where PstI/MboI digestion retained the most abundant fragment population in the 300–400 bp region (Fig. 1). A preliminary estimation of the methodology by sequencing libraries from two parental of our breeding program, Dixiland and Summerprince, yielded 780,647 and 829,004 pair-end reads (2 × 250 pb), respectively (Experiment 1). An initial analysis identified 1437, 225 and 149 polymorphic and segregant SNP, InDel and SSR, respectively; which covered the 8 chromosomes of peach (data not showed). The comparisons of our results with data from previous studies was not possible, since the number of polymorphic markers between two genotypes depends on the analyzed genotypes as well as on the power of the platform. Nevertheless, the number of SNP were in the same order of the previous work that used GBS for genotyping a F2 population19. Therefore we considered that the experimental conditions were suitable and scaled up the protocol for the genotyping the whole germplasm collection (Experiment 2).
Analysis of genome coverage of the platform
In experiment 2, a plum cultivar and 191 accessions (189 peaches and 3 rootstocks) from the EEA San Pedro germplasm active collection were analyzed (Supplementary Table S1). The libraries construction in batch of 24 samples, performed as described previously23, resulted in 8 pools. The DNA content of each pool was combined and normalized according to the DNA quantity for subsequent sequencing.
The sequencing retrieved 200,759,000 of paired-end (2 × 250 bp) reads and after quality filtering, 98.3% of these reads were retained. Two samples: ‘Suncrest’ and ‘Flordaglobe’ retained hardly any reads (296 and 37,469, respectively) and, therefore, were discarded for further analyses. In average, the analysis of each sample retrieved 1.04 × 106 paired-end reads (2 × 250 pb) with a variation coefficient (VC) of 28.14%. The reads obtained per sample were between 393,149 to 2,168,460, thus accumulating around 1 M reads (Supplementary Fig. S2). No significant differences were observed between total reads obtained from DNA extracted by CTAB method or commercial DNA extraction kits (data not showed). As expected, the higher the read number, the higher the breadth and depth coverage will result (Fig. 2). At around 1.5 × 106 reads, the breadth coverage seems to reach a steady state of 5% with a minimum depth of 15×. The alignment of all data merged like a single individual gave a total coverage of the peach genome of 25%, with a mean depth of 15× (ranging from 7× to 27×).
The peach genome v2.04 was separated in bins of 1000 bp and the reads obtained from each sample were mapped into bins to analyze the overlapping coverage. Although is possible that two or more reads (of 250 bp) found in a bin (of 1000 bp) could not actually overlap, we consider that bin size to simplify computation requirements. Therefore, the results of the analysis were taken as an estimation of actual common coverage between samples. As a result, the reads were evenly distributed around the eight chromosomes with the exception of the regions predicted to harbor the centromeres (Fig. 3). Most of the bins had less than 300 reads in average for each genotype. Surprisingly, a bin on chromosome 1 at positions 14,777,945–14,778,945 (Pp01-14,777,945–14,778,945) accumulated 10× more reads (2698) than the average (Supplementary Fig. S3). A blast analysis of the sequence of peach genome at that region showed high homology with mitochondrion sequences.
The overlapping of genome coverage between samples was analyzed by inspecting the correlations of the reads number mapped in bins among the sample set. A high correlation between two libraries indicates that in average reads fall in the same bins and therefore a high proportion of overlapping genome regions are covered in that two samples. The heatmap revealed four blocks (I, II, III, IV) of highly correlated samples (Fig. 4a). Block I consisted of the two samples used for the fine tuning of the platform (experiment 1). For experiment 2, 8 pools of 24 libraries each were prepared but different grade of similarities between the pools were revealed by the correlation analysis (Fig. 4a). Pools 1–5 formed block II, which indicated similar coverage of genome. Similarly, Pools 6–7 and most of the samples of Pool 8 showed a good coverage between them (Block III). The last 10 samples of pool 8 formed a separate block (IV). As expected, a failing sample (‘Suncrest’) showed a very low correlation (revealed with dark blue in the heatmap) with the rest of the samples. It is important to mention that 296 reads of this sample had been retained after quality check. The plum sample (in Pool 8) also presented low correlations with all peach samples, thus reflecting the genome differences of the two species. Pools 1 and 6 showed the lowest correlation between pools.
The correlations obtained for all pairs of peach samples, except the two failed samples (‘Suncrest’ and ‘Flordaglobe’), ranged from 0.3805 to 0.9798, accumulating around 0.93 (Supplementary Fig. S4). The high correlation within samples of a block translates to a high number of common sites sequenced in all the member of a block. The common sites varied between 2,725,815 for block IV to 5,881,715 for block I (Fig. 4). The failed samples (‘Suncrest’ and ‘Flordaglobe’) and the plum were not considered for common sites determinations.
To assess how the experimental conditions improve the common coverage between samples, we analyzed the number of common sites per pool against 8 artificial pool created in silico by mixing samples from different experimental pools (Supplementary Table S1). The experimental pools showed more common sites than the artificial pools (Fig. 4b), in average 10% more (3,376,341 vs 3,068,163, respectively, α < 0.01, n = 8). Taking into account both experiments, 2,026,509 common sites between the 191 peach samples were scrutinized with the platform described here.
To get further understanding of how experimental conditions affect the overlapping coverage between samples, a PCA was performed with the reads mapped on bins without taking into account the plum and the two failed samples (Fig. 5). A wide proportion of the variance (82.87%) is explained by PC1, which separated samples without an obvious trend (i.e. not according the extraction method or batch of analysis: experiments/pools). The dispersion of samples along the PC1 correlated with the number of reads obtained for each sample (Supplementary Fig. S5). Samples from Blocks I and II are separated from samples from Blocks III and IV according to PC2, which explains 8.03% of the variation. The samples within Block IV are separated by PC5, which accounted for 1.04% of the variance (Supplementary Fig. S6).
Variant identification and genotyping
To assess the overall power of the platform developed here, the number of variants detected was analyzed. The sequences obtained for all the peach accessions (including rootstocks) comprising 191 genotypes (two from experiment 1 and 189 from experiment 2, in which the plum and the 2 failed libraries were discarded) were analyzed together. The analysis retrieved 113,411 SNP, 13,661 InDel and 2133 SSR polymorphic variants against the peach genome v2.0 in the whole sample set.
A common drawback of the RADseq platforms is the proportion of missing data between samples17. For this reason, we analyzed the number of variants in common along the sample set to assess the platform in this regard (Fig. 6). Taking into account the variants present in only one sample, we identified 3674 SNP, 4318 InDel and 362 SSR. On the other hand, 6028 SNP, 600 InDel and 191 SSR are genotyped in the whole sample set (191 peach accessions). The distribution of the variants followed similar trends, as SNP, InDel and SSR accumulated in few samples or in almost the complete set (191 samples). In the case of SNP, the variants in common dropped slowly until 40 samples and increased sharply from 181 samples to reach a higher number of variants in common in the complete set (6028 SNP). For InDel and SSR, the variants in common dropped sharply at 5 samples and then increased from 189 samples, thus reaching a lower number of variants in common than in that found in one sample. In summary, 50% of the SNP (55,719/113,411) are present in 25 or less samples, whereas, for InDel, the 50% of the variant (6800/13,661) are present in 4 samples or less. For SSR, 4 samples or less account for 25% of the total number of markers identified. On the other hand, 6028 SNP, 600 InDel and 191 SSR are genotyped in the whole sample set (191 peach accessions), thus, identifying one variant for each 297 sites strutted (2,026,509 common site/6819 variant identified).
Several criteria, regarding missing data and minor allele frequency (MAF) accepted, could be taken according the downstream analysis to be conducted37. Supplementary Table S3 displays the data sets obtained (for the 191 peach accessions) according to different criteria. In this section, we will restrict our analysis to the data set obtained according to a < 5% of missing data and a minor (MAF) equal to or greater than 1%. This dataset contains 9325 variants, which comprise 7967 SNPs, 980 InDels and 378 SSR. The SNPs are biallelic with 1,521,697 data points (191 × 7967), of which 1.33% are missing data and 14.54% heterozygous positions. The Ts/Tv ratio reached is 1.33, with 225,942 transitions (Ts) and 169,551 transversions (Tv). Regarding the 980 InDels, 91 are triallelic and the remaining 889 are biallelic, with 1071 alternative alleles to the reference genome. The Deletion/Insertion ratio against the peach genome is 0.93, with 515 deletions and 556 insertions. The allele length difference (between the alternative and the reference allele obtained from the peach genome v2.0) was from 1 to 31 bp, with a mode of 1 bp (Supplementary Fig. S7). Almost half of the InDels have a length difference of 1–2 bp (45.75%), 42.67% have a length difference of 3–10 bp, and 11.58% have a difference length of 11–31 bp. With 187,180 (191 × 980) data points for InDel, 1.09% correspond to missing data and 19.23% to heterozygous positions.
In the 378 SSR found in the analysis, 152 are biallelic, 216 triallelic and 10 tetraallelic. The motif length of SSR was from 1 to 6 nt, with 71 (18.78%) mononucleotide, 272 (71.96%) dinucleotide, 22 (5.82%) trinucleotide, 8 (2.12%) tetranucleotide, 3 pentanucleotide (0.79%) and 2 (0.53%) hexanucleotide (Supplementary Fig. S8). Only 3 mononucleotide motifs were found in the analysis: T (39), A (31), and C (1). Regarding the dinucleotide motifs AT (95), AG (92) and CT (63) were the most abundant, whereas GT (13) and AC (9) were the less frequent. For the rest of the motifs, different combinations of nucleotides occurred in low proportion (Supplementary Fig. S8). In addition, following Webber’s criterion38, 12 of the SSR are imperfect and the remaining 366 are perfect. In the 72,198 (191 × 387) data points for SSR, 1.25% has missing data and 30.38% heterozygous positions.
The potential of the platform was assessed for functional variant identification by analyzing the predicted effect of the markers on the peach genome. The 9325 identified variants may cause 42,509 putative effects, according to the analysis. The high number of predicted effects could be due to the presence of multiple transcripts for a gene and to the fact that the analysis takes into account the effect of each one. Moreover, some genes overlap, so a single variant could affect multiple transcripts on multiple genes, with different effects35. Our study identified 89 genes of high impact, 1532 of moderate impact, 2341 of low impact, and 38,553 modifiers (Supplementary Fig. S9). Regarding the affected genomic region, the most affected areas are the downstream (up to 5 kb downstream of polyA addition site), upstream (up to 5 kb upstream of the transcription start site) and intronic region, with an impact of 32%, 26% and 19%, respectively (Supplementary Fig. S10). Importantly, although in lower proportion, many areas of interest were affected. For example, 3510 (8.25%), 1471 (3.46%) and 996 effects (2.34%) take place in the exonic region, 3′ UTR and 5′ UTR, respectively (Supplementary Fig. S10). From these data, we determined 1946 synonymous substitutions and 1497 nonsynonymous substitutions.
The distribution of variants along the peach genome was analyzed and compared to the 9K SNP Infinium II (Fig. 7 and Supplementary Fig. S11). The SNP covered all the 8 peach chromosomes, with the exception of the region near to the centromeres, as the case of the SNP array. The platforms shared only 133 SNPs in common. In our platform, most of the 1 kb-bin covered has 1 SNP, although the platform allowed the discovery of some hot spot of density, for example at the top of Chromosome 2 (Pp02) and the bottom of Chromosome 4 (Pp04) (Fig. 7). Despite covering less proportion of the genome, the InDel and SSR were detected in all chromosomes (Supplementary Fig. S11).