# Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence

Latest Research

### Aptardi design

The overall goal of aptardi is to accurately identify the polyA sites of expressed transcripts in a given biological sample. Specifically, aptardi analyzes the modified 3′-terminal exon (see the Transcript processing section below for details on 3′-terminal exon modification) of previously annotated transcripts and, using relevant RNA-Seq data and DNA sequence in a machine learning environment, identifies locations of expressed polyA sites in the region. Aptardi then annotates the 3′-termini to match these locations and outputs these transcript structures to the transcriptome (in GTF format) that can be easily incorporated into downstream analyses. Note that aptardi does not evaluate the intron chain structure of transcripts, i.e., it only examines the modified 3′ terminal exon of each transcript structure and alters the 3′ terminus location(s) accordingly. Also, note that aptardi outputs all original transcript structures from the original transcriptome in addition to transcripts identified through its analysis, i.e., the program only adds transcripts.

### Data sets

A total of five unique data sets, hereafter referred to as HBR, 2nd HBR, UHR, BNLx, and SHR, were subjected to aptardi’s machine learning pipeline. In addition to RNA-Seq measurements, each data set required DNA sequence, a transcriptome, and—since each was used to build a machine learning model—a “gold standard” data source providing locations of expressed, i.e., “true” polyA sites. HBR, 2nd HBR, and UHR are well-established RNA reference samples from the MAQC/SEQC consortium54 (see RNA-sequencing data sets for more details). BNLx and SHR represent two inbred rat strains: the congenic Brown Norway strain with polydactyly-luxate syndrome (BNLx/Cub) and the spontaneous hypertensive rat strain (SHR/OlaIpcv), respectively.

### DNA sequence data sets

For BNLx and SHR, strain-specific genomes were generated from the rn6/Rnor_6.0 version of the rat genome69 and are publicly available on the PhenoGen website. The human reference genome (hg38/GRCh38), accessed via the UCSC Genome Browser70, was utilized for the HBR, 2nd HBR, and UHR DNA sequence data sets.

### RNA-sequencing data sets

The HBR and 2nd HBR RNA-Seq data sets were derived from the Human Brain Reference (multiple brain regions of 12 donors, Ambion, p/n AM6050), and the UHR RNA-Seq data set was derived from the Universal Human Reference (10 pooled cancer lines, Stratagene, p/n 740000). Each of these data sets were accessed from the Sequence Read Archive (SRA) using the SRA Toolkit (v.2.8.2) as publicly available data (HBR55: Accession: PRJNA510978, SRA runs: SRR8360036-37; 2nd HBR and UHR56: Accession: PRJNA362835, 2nd HBR SRA runs: SRR5236425-30, UHR SRA runs: SRR5236455-60; BNLx and SHR71: Accession: GSE166117, BNLx: GSM5061950-52, SHR: GSM5061947-49). In brief, all libraries were generated with the TruSeq stranded (HBR, 2nd HBR, UHR) or unstranded (BNLx and SHR) mRNA sample preparation kit (Illumina), sequenced on a HiSeq2500 Instrument (Illumina), and sequencing results processed to FASTQ files. The HBR RNA-Seq data set originated from 1 µg RNA starting material, whereas 100 ng input was used for 2nd HBR and UHR RNA-Seq data sets. For more detailed descriptions on these publicly available data, see Palomares et al.55 (HBR) and Schuierer et al.56 (2nd HBR and UHR). For BNLx and SHR, RNA‐seq libraries prepared from the polyA+ fraction were constructed using the Illumina TruSeq RNA Sample Preparation kit from 1 μg of brain RNA in accordance with the manufacturer’s instructions. Four μL of a 1:100 dilution of either ERCC Spike-In Mix 1 or Mix 2 (ThermoFisher Scientific) were added to each extracted RNA sample. An Agilent Technologies Bioanalyzer 2100 (Agilent Technologies) was utilized to assess sequencing library quality. RNA samples from three biological replicates per strain were processed and sequenced71. All reads were paired-end but differed in read length (HBR, BNLx, and SHR: 2 × 100, 2nd HBR and UHR: 2 × 75). Individual FASTQ files were assessed for quality using FastQC (v.0.11.4, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and, if necessary, reads were trimmed with cutadapt72 (v.1.9.1). For the purpose of read coverage used by the aptardi algorithm, reads from technical replicates (HBR = 2, 2nd HBR = 3, UHR = 3) or biological replicates (BNLx = 3, SHR = 3) were concatenated and aligned to their respective genomes (see DNA sequence data sets section for more details) using HISAT273 (v.2.1.0) with the–rna-strandness (when appropriate) and–dta options specified as recommended for transcriptome assembly with StringTie22 (see Transcriptome data sets below for more details) and otherwise default arguments (see Supplementary Table 5 for alignment results). After alignment, SAMtools74 (v.1.9) was used to remove unmapped reads and convert the output to a sorted Binary Alignment Map (BAM) file required as input by aptardi.

### True polyadenylation sites data sets

True polyA sites (i.e., labels for machine learning) were taken from Derti et al.46 for all data sets. Namely, total RNA from the same UHR and HBR RNA reference samples, as well as brain total RNA from the Sprague Dawley rat (Zyagen, p/n RR-201), were subjected to PolyA-Seq analysis to identify the genomic locations of expressed polyA sites in each sample (for more information, see Derti et al.46). High-quality filtered polyA sites from each RNA sample were accessed using the UCSC Table Browser75, and liftOver76 (from the UCSC Genome Browser Group) was used to convert the genomic coordinates to the most recent human genome assembly (hg38/GRCh38) for the HBR and UHR samples, or rat genome assembly (rn6/Rnor_6.0) for the rat brain sample. PolyA sites identified in the HBR and UHR RNA reference samples were used for the corresponding HBR, 2nd HBR and UHR data sets, and those identified in the rat brain were used for both BNLx and SHR. Derti et al.46 uploaded technical replicates to UCSC Table Browser for each RNA sample; however, as polyA sites within 30 bases were clustered into the single site with greatest expression, and since this was done separately for each data set, we utilized only a single data set for each sample, i.e., technical replicates were not combined.

### Original transcriptome generation

StringTie22 (v.1.3.5) was used to reconstruct the transcriptome expressed in each data set from their RNA-Seq data, hereafter referred to as the original transcriptome. Ensembl59 (v.99) reference annotation from the respective species was provided to guide the StringTie reconstruction. We note that a user can simply use reference annotation directly, i.e., Ensembl annotation, in lieu of performing transcriptome assembly that takes into account expression, i.e., StringTie. If the RNA-Seq data were stranded, the read orientation was specified as an argument to StringTie. Transcript structures from scaffold chromosomes and unstranded contigs, if present, were removed.

### The data processing pipeline

#### Transcript processing

Using the original transcriptome, the 3′ terminal exons of transcripts were isolated. Each transcript’s 3′ terminal exon was extended 10,000 bases plus two times the bin size (i.e., 10,200 bases for the default 100 base bin size) similar to what has been done previously43,44. Extensions overlapping any neighboring transcripts (on the same strand) were shortened to remove the overlap. RNA-Seq coverage at single-nucleotide resolution was obtained via bedtools genomecov (BEDtools77; v2.29.2) and, similar to the criteria employed by Ye et al.44 and Miura et al.78, used to refine each of these 3′ terminal exons, hereafter referred to as modified 3′ terminal exons (see Supplementary Methods for more details.) The refinement step either shortened the extended 3′ terminal exon or kept it the same length to give the modified 3′ terminal exon.

#### Feature extraction

Features were engineered in 100 base increments along the modified 3′ terminal exon, referred to hereafter as bins. For each bin, a total of 27 features were engineered and can be broadly classified as being derived from DNA sequence or RNA-Seq data. In both cases, information from the local environment, i.e., the 100 bases upstream and downstream the bin, as well as the bin itself, (300 bases total) was used.

#### DNA sequence features

The choice of DNA sequence features was made through a combination of an exhaustive literature review25,50,79,80,81,82,83,84,85,86 and evaluation of other algorithms that use DNA sequence to predict polyA sites52,53,87,88. Perhaps the most well-known indicator of polyadenylation is the polyadenylation signal (PAS), a conserved hexamer located ~10–35 nucleotides upstream the polyA site. Overrepresented sequences of DNA, or DNA sequence elements, also influence polyadenylation, and the location of these sequences are often described relative to the PAS. As such, DNA sequence features were engineered by first identifying the presence of several known PAS’s. Specifically, for each bin, a six base sliding window scanned a predefined region to detect the presence or absence (binary indicator of 1/−1) of (1) the canonical PAS (AATAAA), (2) its major variant (ATTAAA), (3) a second common variant (AGTAAA), and (4) any one of nine other minor variants (AAGAAA, AAAAAG, AATACA, TATAAA, GATAAA, AATATA, CATAAA, AATAGA25,50,79,80) for four total PAS features. Subsequently, regions relative to the PAS (if present, otherwise predefined regions relative to the current bin) were likewise scanned using a sliding window approach to determine frequency of the following known DNA sequence elements: (1) a G-rich region downstream the PAS, (2) a downstream region near the PAS enriched in TTT, (3) a downstream region near the PAS enriched in GT/TG, and (4) a downstream region near the PAS enriched in GTGT/TGTG, (5) a T-rich region immediately downstream of the PAS, (6) a T-rich region upstream the PAS, (7) a TGTA/TATA-rich region upstream the PAS, and (8) a AT-rich region upstream and downstream the PAS25,50,81,82,83,84,85,86 for an additional eight features (12 DNA sequence features total). If the frequency of the given DNA sequence element was above an enrichment threshold, the feature was encoded 1, otherwise −1. (See Supplementary Methods for more details.)

#### RNA-sequencing features

From the RNA-Seq data, coverage at single-nucleotide resolution was determined using BEDtools77 (v2.29.2). The approach for designing RNA-Seq features was to exploit localized fluctuations in RNA read coverage similar to that implemented by tools designed for APA-specific analysis from RNA-Seq data43,44,45. Intuitively, upstream but in close proximity to the end of a transcript, coverage is expected to begin to decrease gradually until its end. As a result, changes in expression were utilized when designing RNA-Seq features in two scenarios: (1) intra- and (2) inter-bin. In both cases, three regions were defined: an upstream region, a middle region, and a downstream region (Supplementary Fig. 4). Changes in expression between these regions were quantified using various mathematical combinations of coverage values in each region to generate 14 unique features (see Supplementary Methods for more details). To account for local variability in RNA-Seq coverage, median coverage values in each region were used.

A final feature was derived from the original transcriptome. If the 3′ base of any annotated transcript from the original reconstruction was located within a bin, this feature was encoded 1, otherwise −1. Supplementary Fig. 5 summarizes the data processing pipeline prior to machine learning.

### Building aptardi

The machine learning task is two class classification (polyA site or no polyA site) of each 100 base bin. Supervised learning was used where labels for training were provided from the polyA sites data sets (see PolyA sites data sets for more details). A bidirectional long short term memory recurrent neural network (biLSTM)89,90,91,92 was implemented using the Keras (v.2.3.1) wrapper for TensorFlow (v.2.0.0). This machine learning paradigm was chosen because of its design to analyze sequential data, i.e., it takes into account all the 100 base bins of a given transcript when learning model parameters for each individual bin. Each direction of the biLSTM consisted of 20 nodes (40 total), and this layer was followed by a fully connected dense layer with a sigmoid activation function that outputs a probability value. Of note, the biLSTM outperformed traditional classifiers such as Random Forest (RF) and Support Vector Machine (SVM) models (biLSTM: AP = 0.58, F-measure = 0.52; RF: AP = 0.42, F-measure = 0.39; SVM: AP = 0.37, F-measure = 0.33; numbers shown are on the testing set using the HBR data set).

### Training aptardi

To prevent duplicate bins, overlapping modified 3′ terminal exons were merged prior to training. In addition, all merged modified 3′ terminal exons were masked to a length of 300 bins (30,000 bases total) to generate equal lengths, which is required for the sequential model. A total of 778,166 bins were present, of which ﻿42,977 possessed a polyA site out of 94,322 polyA sites annotated by the PolyA-Seq data (for the HBR data set). Merged modified 3′ terminal exons were split into 60/20/20 training, validation, testing sets, respectively. Quantitative measures were standardized using the training set, and the training set was used to build the model in 25 epochs. Owing to the high imbalance of the data, class weights were used during training. Model weights were optimized using a binary cross entropy loss function and Adam93 optimizer. Precision and recall metrics on the training and validation sets were monitored during training to prevent overfitting, and the model that produced the minimum loss was kept. For evaluation purposes (see Results), individual prediction models were generated from each of the five data sets.

### Evaluating aptardi

Precision, recall, and F-measure at the default probability threshold (0.5) were used to evaluate model performance defined as follows:

$$P = frac{{T_p}}{{T_p + F_p}}$$

(1)

$$R = frac{{T_p}}{{T_p + F_n}}$$

(2)

$$F = 2 ast frac{{(P ,*, R)}}{{(P + R)}}$$

(3)

where Tp = true positive, Fp = false positive, Fn = false negative, P = precision, R = recall, and F = F-measure

To generalize model performance over the range of probability thresholds, AP was used in place of the receiver operating curve owing to the highly imbalanced nature of the data94 (far fewer bins with polyA sites than bins without a polyA site):

$${mathrm{AP}} = mathop {sum }limits_n (R_n – R_{n – 1})P_n$$

(4)

where AP = average precision and (R_n) and (P_n) are the precision and recall at the nth threshold, respectively.

### Integrating aptardi results with the original transcriptome

For 100 base bins where a polyA site is predicted, transcript structures are annotated to the 3′ most base position unless either (1) the input transcript’s stop site is already in the region or (2) the 3′ most base position is within 100 bases of the input transcript’s stop site. Any aptardi transcript structures were added to the original transcriptome, and this aptardi modified transcriptome was outputted as a GTF file.

### Choice of bin size

To assess the impact of bin size, 25, 50, and 150 base bins were used to train the aptardi prediction model on the HBR data set in an otherwise identical manner to the original 100 base bin. The AP on the testing set was 0.41, 0.51, and 0.61 for the 25, 50, and 150 base bins, respectively, compared with 0.58 for the original 100 base bin. In addition, the F-measure on the testing set was 0.35, 0.46, and 0.52, respectively, compared with 0.52 for the for the original 100 base bin. Based on these results, 100 base bins were utilized to build the pre-existing aptardi prediction model. Higher resolution bin sizes may be appropriate depending on the data set (e.g., RNA-Seq library preparation), species, etc. and bin size can be specified as a parameter when using the aptardi pipeline.

### Software

A user has the option of using the pre-existing aptardi prediction model or building a prediction model (if a reliable true polyA sites data set is available). The pre-built model, i.e., the aptardi prediction model, provided on the aptardi GitHub repository (https://github.com/luskry/aptardi) was generated from the HBR data set55. Several other algorithm options are available.

### TAPAS analysis

TAPAS (Tool for Alternative Polyadenylation site AnalysiS) predicts the locations of polyA sites from RNA-Seq and reference annotation (genome or transcriptome)45. Its performance was evaluated on HBR, specifically using the HBR RNA-Seq data and StringTie/Ensembl original transcriptome. As TAPAS makes predictions on noncoding sequence coordinates of transcript models (i.e., 3′-untranslated regions), the 3′ terminal exon of each transcript was provided as the noncoding region, and default arguments were used. The TAPAS polyA site prediction program (APA_sites_detection) was used here and accessed via its GitHub repository (https://github.com/arefeen/TAPAS).

### APARENT analysis

APARENT (APA REgression NeT) predicts the locations of polyA sites from DNA sequence60. Its performance was evaluated using the human reference genome (hg38/GRCh38) and transcript structures generated from the HBR StringTie/Ensembl original transcriptome. Specifically, similar to that for aptardi, the 3′ terminal exon of each of the 113,923 transcript models in the original transcriptome were extracted. Since APARENT requires DNA sequences to be ≥205 nucleotides and ≤10,000 nucleotides, 20,658 3′ terminal exons were removed from the analysis. For the remaining 93,265 3′ terminal exons, the DNA sequence was extracted using BEDtools (v2.29.2). For 3′ terminal exons on the negative strand, the reverse complement sequence was used. The APARENT model (aparent_large_lessdropout_all_libs_no_sampleweights.h5) from its GitHub repository (https://github.com/johli/aparent) was used to predict the locations of polyA sites using default parameters.

### CFIm25 knockdown analysis

RNA-Seq from HeLa cells and RNA-Seq after RNA interference on HeLa cells was used to generate the control RNA-Seq data set and the treatment CFIm25 knockdown RNA-Seq data set that induces APA switching, respectively. The RNA-Seq data sets were accessed from SRA as publicly available data (Accession: PRJNA182153, control SRA run: SRR1238549, CFIm25 knockdown SRA run: SRR1238551). The RNA-Seq library preparation was unstranded, and 100 base paired-end reads were sequenced on an Illumina HiSeq 2000 instrument (see Masamha et al.61 for more details). Reads were processed in a manner identical to all other data sets to produce a sorted BAM file (see Supplementary Table 6 for alignment results). An original transcriptome using StringTie/Ensembl (without the read orientation argument) and an aptardi modified transcriptome were generated for each RNA-Seq data set (four total). The aptardi prediction model produced using the HBR data set was used to generate the aptardi modified transcriptomes.

### Mouse tissue analysis

RNA-Seq from mouse brain and liver tissues were taken from Li et al.64 and accessed from SRA as publicly available data (Accession: PRJNA375882, brain SRA runs: SRR5273637 and SRR5273673, liver SRA runs: SRR5273636 and SRR5273672). The two SRA runs per tissue represent technical replicates from one biological sample—namely 6-week old female C57BL/6JJcl mice. In brief, libraries were constructed using polyA selection and the Illumina TruSeq RNA-Seq library protocol and sequenced using an Illumina HiScan platform to generate 100 base, paired end, and unstranded reads. Reads were trimmed for quality and adapter content with Trimmomatic95 (v.0.39), and the two technical replicates per tissue were aligned together to the mm10/GRCm38 genome using HISAT2 (v.2.1.0; see Supplementary Table 3 for alignment results). An original transcriptome was generated using StringTie (v.1.3.5) and the mouse Ensembl (v.102) annotation as a guide and otherwise default settings. BALB/c mouse brain and liver PolyA-Seq data from Derti et al. were used to define the tissue-specific true polyA sites. Namely, high-quality filtered true polyA sites were accessed using the UCSC Table Browser, and liftOver (from the UCSC Genome Browser Group) was used to convert the genomic coordinates to the most recent mouse assembly. Tissue-specific polyA sites in brain and liver were defined as polyA sites in the given tissue PolyA-Seq data set not within 100 bases of any polyA site in the other tissue PolyA-Seq data set. The mouse reference genome (mm10/GRCm38), accessed via the UCSC Genome Browser was used for DNA sequence for both tissues. Comparisons between the polyA sites from the given PolyA-Seq data and transcriptome were done by defining that a transcript annotated a polyA site if its 3′ terminus was within 100 bases of the site.

### Rat differential expression analysis

To generate a single transcriptome representing both rat strains, their genome-aligned RNA-Seq data were merged using SAMtools74 (v.1.9) followed by the production of an original transcriptome using the merged RNA-Seq data set and StringTie/Ensembl (without the read orientation argument). The aptardi modified transcriptome was produced using the original transcriptome as input, along with the merged RNA-Seq data, the rn6/Rnor_6 DNA sequence (accessed via the UCSC Genome Browser70), and the aptardi prediction model built from HBR. RSEM96 (v.1.2.31) was used to estimate the abundances of the isoforms identified within each transcriptome (the original transcriptome and aptardi modified transcriptome). Prior to quantitation, transcripts from scaffold chromosomes and unstranded contigs were removed from both transcriptomes. Isoform level expression estimates were determined for each biological sample (BNLx = 3, SHR = 3). Isoforms without at least 50 counts in two of the three biological replicates for at least one strain were removed, and differential expression between the two strains (with BNLx as reference) were evaluated using DESeq297 (v.1.28.0) for the remaining set of isoforms in each transcriptome (Supplementary Fig. 6 summarizes these analysis steps). A significance threshold of 0.001 was applied to the unadjusted p values to allow for comparisons across the two data sets (original transcriptome and aptardi modified transcriptome) that differ in the number of transcripts tested.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.