# Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads

Latest Research

Aquila is organized in four conceptual modules that correspond to the following python steps (Fig. 1): Aquila_step1.py: Haplotyping module + contiguity module (first part: partitioning). Aquila_step2.py: Local assembly module + contiguity module (second part: concatenation). Variation Module: Aquila_assembly_based_variants_call.py and Aquila_phasing_all_variants.py.

### Pruning unreliable variants (haplotyping module)

Accurate haplotyping requires filtering out incorrectly genotyped variants and false positives due to sequencing error. We performed an empirical analysis for 10X data to investigate the alternate allele frequency (Ralt/ref) and coverage per variant (dvar) that could be used as metrics to find erroneous calls. Allele frequency (Ralt/ref ≥ 0.25) was used for a cutoff, and a 2-tailed percentile cutoff was used for coverage per variant (10%*avg_cov ≤ dvar ≤ 90%*avg_cov, avg_cov: average read coverage per variant). The haplotyping algorithm was further improved by sacrificing a small amount of low-confidence heterozygous variants. SNP quality (13 by default) is the final free parameter used to prune variants.

### Inference and phasing of original long-fragments (haplotyping module)

All DNA fragments are first reconstructed by aligning short reads to the human genome reference (Hg38) by a barcode-aware alignment strategy (Longranger align, https://support.10xgenomics.com/genome-exome/software/pipelines/latest/installation). Aquila then sorts all reads by barcodes and positions, and collects the reads with the same barcode to reconstruct each fragment. There is a threshold to differentiate two molecules with the same barcode when the distance between two successive reads with the same barcode is larger than 50 kb (50 kb by default, free parameter). After reconstructing all fragments, Aquila assigns the alleles of heterozygous SNPs to each fragment by scanning the reads belonging to each fragment and comparing to a VCF file generated by FreeBayes. At a heterozygous locus “0” is the reference allele and “1” is the alternate allele.

For each pair of heterozygous variants, if the even parity was correct where one haplotype supported “00”, and the other haplotype supported “11”, the odd parity must then have been caused by a sequence error with some fragments supporting “01”, and other fragments supporting “10”, and vice versa. In rare scenarios, the fragment could have two sequencing errors when the even/odd parity was correct, but the fragment supported the complementary haplotype (e.g., the haplotype is “00”, and the fragment supports “11”). For each fragment with at least two heterozygous SNPs, Aquila records all neighboring pairs of heterozygous variants. It then applies a Bayesian model (see below) to evaluate if even or odd parity is correct, and the clusters with the parity caused by sequencing error are excluded from further steps. Importantly, the excluded clusters are due to the variants caused by sequencing errors, not the molecules themselves, which means if these molecules still contain other pairs of heterozygous variants with consistent haplotype with the correct parity, they are still used for haplotyping.

Aquila then performs a recursive clustering algorithm in two haplotypes to aggregate bigger clusters/phase blocks. Two clusters are merged if the number of molecules in both of them supporting the same haplotype exceeds a threshold. This threshold is set to 3 by default, which corresponds to a merging error percentage ≤ ((1−p1)(1−p2))3, for each pair of variants, if each variant matched the true variant with probability p1 and p2, respectively. Aquila sorts all pairs of clusters by the positions of reads of all molecules in each cluster. When two locally successive clusters are merged into one single cluster, the corresponding clusters of the other haplotype are merged too. The resulting pairs of clusters are sorted again for the next iteration. The sorting algorithm complexity is ~O(Nvar log Nvar), where Nvar is the total number of heterozygous variants. Aquila performs clustering recursively until no more clusters can be merged based on the supporting threshold.

The result of this step are pairs of clusters where each pair corresponds to one diploid phase block. For each phase block, Aquila then performs haplotype construction and extension when each heterozygous variant is supported by all the molecules that cover it. When there are multiple molecules supporting inconsistent genotypes for a variant, that variant is excluded from further steps. To further extend the phase blocks, Aquila similarly performs recursive clustering when two phase blocks have a number of overlapping variants greater than a certain threshold. The threshold is set to 5 by default so that the merging error due to sequencing error p is ≤p5. When no more phase blocks can be merged the process has converged.

In the final step, all the molecules/fragments with at least two heterozygous variants are assigned to a phase block based on the variants in the final phase block, and a maximum-likelihood estimation is applied. Given a haplotype H and a molecule M, we apply the theta function (theta (H_i,M_i) = 1) if Hi = Mi and 0 otherwise. Given pi, the probability the allele call at variant i in molecule M is correct, the likelihood of observing molecule M is

$$p(M|p,H) = {Pi}theta (H_i,M_i)p_i + (1 – theta (H_i,M_i))(1 – p_i)$$

(1)

Similarly, given the complementary haplotype Hc and molecule M, a theta function gets the value (theta (H_{{mathrm{{c}}}i},M_i) = 1) if Hci = Mi and 0 otherwise. The likelihood of observing molecule M is: (p(M|p,H_{mathrm{{c}}}) = {Pi}theta (H_{{mathrm{{c}}}i},M_i)(1 – p_i) + (1 – theta (H_{{mathrm{{c}}}i},M_i))p_i). To assign the final phase block to each molecule M, Aquila needs to find the haplotype j in the final phase blocks that meets ({mathrm{argmax}}_jleft( {pleft( {M{mathrm{|}}p,H^j} right) – pleft( {M{mathrm{|}}p,H_{mathrm{{c}}}^j} right)} right)).

### Probability model to determine correct joint key (haplotyping module)

At each position, the genome has two complementary keys matching the true two haplotypes: 00, 11 (even parity) or 01, 10 (odd parity). Each variant matches the true variant with probability p1 and p2, respectively. The probability that a sequence key will have the correct parity is pc =  p1p2+(1−p1)(1−p2), since both variants could match the true variants or both variants are called wrong. Let N be the number of sequences observed that have both variants, and k be number of sequences with key of even parity, and (N−k) be the number of sequences with key of odd parity. Using Bayes’ theorem to test P(B|A):

Where A: k out of N molecules have keys with even parity, B: true key is even parity.

$$Pleft( {B{mathrm{|}}A} right) = frac{{Pleft( {A{mathrm{|}}B} right)P(B)}}{{P(A)}}$$

(2)

Aquila will accept even parity is correct if P(B|A) exceeds a significance level (e.g. >0.99), where:

P(B|A) is the probability of the true key being even parity given k out of N molecules have keys with even parity.

P(A|B) is the probability of k out of N molecules having keys with even parity given the true key is even parity, which is (Pleft( {A{mathrm{|}}B} right) = left( {begin{array}{*{20}{c}} N \ k end{array}} right) cdot p_{mathrm{{c}}}^k(1 – p_{mathrm{{c}}})^{N – k}).

P(A) is the probability of k out of N molecules have keys with even parity, which is (Pleft( A right) = left( {begin{array}{*{20}{c}} N \ k end{array}} right)/left( {left( {begin{array}{*{20}{c}} N \ 1 end{array}} right) + left( {begin{array}{*{20}{c}} N \ 2 end{array}} right) + ldots + left( {begin{array}{*{20}{c}} N \ k end{array}} right) + ldots + left( {begin{array}{*{20}{c}} N \ N end{array}} right)} right) = left( {begin{array}{*{20}{c}} N \ k end{array}} right)/2^N).

P(B) is the probability of even parity being correct, P(B) = 1.

### High-confidence partitioning point profile generation (assembly module)

Large phase blocks (free parameter:– block_threshold = 200 kb by default) are cut into multiple small chunks of a specific length (free parameter: –block_len_use = 100 kb by default) based on a high-confidence partitioning point profile. This is done to make assembly faster and more tractable, by avoiding too many reads being given to the assembler. This profile is generated based on three criteria: 1. Expected reads coverage (C), 2. Expected physical coverage (CF), 3. 100-mer uniqueness. Read depth and fragment/physical coverage for each position is calculated after reconstructing all the fragments. The 100-mer uniqueness file34 for hg38 processed and included in Aquila, was downloaded from http://hgdownload.soe.ucsc.edu/gbdb/hg38/hoffmanMappability/k100.Unique.Mappability.bb

Each locus/position is defined to be a high-confidence partitioning point if C(partitioning point) > C(average)*0.8, CF(partitioning point) > CF (average)*0.8, and locus ϵ 100-mer uniqueness. Aquila uses these high-confidence partitioning points in the profile as reference points to partition reads before assembly (see next section) and later to reconnect the resulting mini-contigs into contigs (Supplementary Fig. 1).

### Local assembly within small diploid chunks and stitching contigs (assembly and contiguity modules)

For each phase block (Supplementary Fig. 1), Aquila records its original long molecules and their corresponding short reads. Large phase blocks are cut into small chunks (see previous step) to perform local assembly with SPAdes35 within each phase block for both haplotypes separately. (SPAdes is included in the Aquila package.) Those resulting minicontigs from neighboring small chunks that are bounded by the same partitioning point are concatenated. 99% of partitioning points met this criterion. For each concatenating iteration, the previous concatenated contig is used for the next iteration of concatenation. At the end of this step, Aquila has generated contigs for both haplotypes in each original phase block. The algorithm complexity is ~NchunksO(Tonechunk), where Nchunks is the number of small chunks and Tonechunk is the time for finishing assembly of one small chunk.

### Variation detection for assembled contigs (variation module)

To generate SNPs, small Indel, and SV calls from the de novo assemblies, Aquila uses the contig file of the haplotype 1 and haplotype 2 of each phase block. Minimap236 and paftools (https://github.com/lh3/minimap2/tree/master/misc) are integrated and applied to call variants from each (haploid) contig (“-cx asm5 –cs” is applied for minimap2, and “-l 1 -L 1 -q 20” is applied to paftools). For contig alignments and variant discovery, contigs ≤ 1 kb are filtered out and mapping quality ≥ 20 is chosen to produce variant candidates. Finally, to generate SNPs, if both haplotypes cover the alternate allele, it is defined as homozygous; if one haplotype covers the reference allele and the other haplotype covers the alternate allele, it is defined as heterozygous.

To generate small Indels and SVs, variant candidates from each haplotype are compared against each other to infer zygosity. To achieve that, heterozygous variants are defined if one haploid assembly contains alternate allele(s) and the other haploid assembly contains reference allele(s). Homozygous variants are defined if both haploid assemblies contain alternate allele(s). For compound indel/SV, we split them into two heterozygous variants. Check

“–all_regions_flag = 1” for “Aquila_assembly_based_variants_call.py” in GitHub to perform these analyses.

### Phasing inference (variation module)

The initial phased SNPs from the Haplotyping module provide the scaffold on which all other heterozygous variants that are discovered by the Variation module are phased (Supplementary Fig. 6). For example, consider the case of one assembled SNP in a phase block, “G|A” or “1|0” (where “A” is the reference allele, and “G” is the alternate allele), and the other neighboring assembled SNP in the same phase block, “C|T” or “0|1” (where “C” is the reference allele, and “T” is the alternate allele): these two phased SNPs have the same genotype and phase in a phase block from the haplotyping module. Therefore, Aquila places the SV into the haplotype that is in the same phase. This is done for all heterozygous variants discovered by the Variation module.

### Contig quality assessment

We used QUAST37 to generate various assembly metrics such as N50 and NA50. –extensive-mis-size 1000 was applied as the lower threshold of the relocation size.

### Aquila computation cost

Aquila’s four modules vary in their memory requirements and are flexible with respect to computing architecture, allowing jobs to be run in parallel on a cluster or serially in a large-memory machine. Generally the run time is considerably longer on the large memory machine. On a cluster, jobs are parallelized by chromosome or a combination of chromosomes that minimizes the number of nodes required. Modules 1–3 use 23, and module 4 uses 10 h of wall clock time, respectively, on a current-generation standard compute cluster (23 nodes with 128 Gb RAM each, 2 CPUs each, 10-cores per CPU, i.e. 240 cores). Check tables of computation cost in GitHub for more details.

### Aquila output

Aquila outputs an overall contig file “Aquila_contig.fasta”, and separately for each chromosome “Aquila_Contig_chr*.fasta”. After performing assembly-based variant calling in the Variation Module, one contig file for each haplotype is generated: “Aquila_Contig_chr*_hp1.fasta” and “Aquila_Contig_chr*_hp2.fasta”. For each contig, the header (for example “>36_PS39049620:39149620_hp1”) follows the format contig number (“36”), phase block start coordinate (“PS39049620”), phase block end coordinate (“:39149620”), and haplotype number (“hp1”). Within the same phase block, the haplotype number “hp1” and “hp2” are arbitrary for maternal and paternal haplotypes. For some contigs from large phase blocks, the headers are much longer and complex, for an instance, >56432_PS176969599:181582362_hp1_ merge177969599:178064599_hp1-177869599:177969599_hp1”. “56” denotes contig number, “176969599” denotes the start coordinate of the final big phase block, “181582362” denotes the end coordinate of the final big phase block, and “hp1” denotes the haplotype “1”. “177969599:178064599_hp1” and “177869599:177969599_hp1” mean that this contig is concatenated from minicontigs in small chunk (start coordinate: 177969599, end coordinate: 178064599, and haplotype: “1”) and small chunk (start coordinate: 177869599, end coordinate: 177969599, and haplotype: “1”).

Aquila outputs all raw contigs, even those <1 kb. For downstream analyses (e.g. to apply Merqury24 or to discover variants), it is necessary to filter out small contigs.

Each library of 10X linked-read sequencing data uses the same set of barcodes, which makes the combination of multiple libraries at the stage of raw fastq reads difficult. To combine multiple libraries, Aquila reconstructs the original fragments with their corresponding reads of each library separately, after which barcodes are not required any more. All the DNA fragments from multiple libraries, and their corresponding reads, are then combined to perform haplotyping and all subsequent steps. The molecule haplotyping algorithm is efficient and makes haplotyping fragments for multiple libraries fast. The algorithm complexity is linear with the number of libraries, ~NlibsO(Tonelib), where Nlibs = number of libraries and Tonelib = time for finishing assembly of one library.

### Validation and evaluation methods for variations

To validate the SNP calls from the assembled contigs, we used the GiaB benchmark call sets for both NA12878 and NA24385. They included 3,084,732 SNPs (10,210,585 homozygous and 1,874,147 heterozygous) for NA12878, and 3,076,552 SNPs (1,180,678 homozygous and 1,895,874 heterozygous) for NA24385. The benchmark allowed us to calculate both the sensitivity and the genotype accuracy of each SNP called from assemblies. We performed the same analysis on the small indel calls with the small indel callset 0.6 from GiaB (NA12878 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/, NA24385 ftp://ftp trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest/ GRCh38/). For SV validation we applied svviz2 (https://github.com/nspies/svviz2) with PacBio reads (NA12878: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/ and NA24385: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/).

We classified SVs into three categories: Alu, Tandem Repeats, and Other. We selected SVs from 250 to 350 bp for classification as Alu elements by alignment to the AluY consensus sequence, and only SVs that matched at 80% or more were labeled as Alus. We used tandem repeats finder (trf38) to label repeats/non-repeats for all non-Alu-SVs, and the repeat percentage was calculated. Sequences that did not meet the Alu alignment criterion and did not contain a tandem repeat were labeled “Other”.

For assemblies of different libraries, the different assemblies could identify different SV candidates. To analyze the overlapping and unique SV calls of three libraries of the same individual, we also merged each SV from different libraries. We applied the same merging criteria when we called homozygous/heterozygous SV from two haplotypes. If the coordinates of two deletions overlapped we defined them to be the same deletion, and if the breakpoints of two insertions were within 20 bp, we defined them to be same insertion.

### Ancestral analysis

For each called SV, we extracted the left and right flanking 500 bp, and aligned them to the Orangutan and Chimpanzee references. For deletion calls, if the end coordinate of the left flanking sequence was within 2 bp of the start coordinate of the right flanking sequence, this SV was defined as “Ins_Ref” (insertion on reference—an actual insertion). If the distance between the end coordinate of the left flanking sequence and the start coordinate of the right flanking sequence was the approximate length of the called deletion (0.9*SV_size−1.1* SV_size), this SV was defined as “Del_Tar” (deletion on target—an actual deletion). For insertion calls, if the end coordinate of the left flanking sequence was within 2 bp of the start coordinate of the right flanking sequence, this SV was defined as “Ins_Tar” (insertion on target—an actual insertion). If the distance between the end coordinate of the left flanking sequence and the start coordinate of the right flanking sequence was the approximate length of the called deletion (0.9*SV_size−1.1* SV_size), this SV was defined as “Del_Ref” (deletion on reference—an actual deletion).

We performed multiple sequence alignments (for each SV and its aforementioned flanking regions annotated as “Ins_Ref”, “Del_Tar”, “Ins_Tar” and “Del_Ref”) with Muscle39.

### Comparison of genotype frequencies

We asked whether segregation of the SV alleles behaves as expected. For each locus, we have two chromosomes each from the sequenced individuals and one from the reference genome, for a total of 5, and therefore 18 possible combinations of ancestral and derived alleles among them (Supplementary Fig. S7a). We cannot observe loci in which all chromosomes are identical, leaving 16 patterns that contain one to four derived alleles. Population genetic principles state that derived alleles have systematically lower allele frequencies than ancestral ones. Indeed, for both insertions and deletions, the SV loci that have one derived allele (and four ancestral ones) are much more common than those that have two or more (Supplementary Fig. S7b). In 12 of the segregation patterns, genotypes of NA12878 and NA24385 differ. These can be arranged in six pairs where the individuals’ genotypes are equivalent, for example, one heterozygote and one ancestral homozygote, with the reference carrying a derived allele (rightmost green box in Supplementary Fig. S7a). Distinguishing insertions and deletions, this gives 12 classes (bottom of Fig. 3a), each of which has two SV counts that should be very similar to each other assuming that the two individuals do not come from very different populations. Indeed, the correlation between these equivalence classes is very high (R2 = 0.92, Supplementary Fig. S7c, d).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.