Prediction of PCR amplification from primer and template sequences using recurrent neural network

Latest Research

PCR is one of the basic technologies commonly utilized for genetic as well as pathogen-detection testing25,26. Because of its declining cost, determining the base sequence of DNA or RNA subjected to PCR has now considerably increased27,28,29. Furthermore, the development of applied technologies such as real-time and droplet PCRs and the application range of PCR has been expanded even further30,31,32,33. When PCR is used to detect pathogens, specific detection is required. Moreover, such consideration for specific detection can be affected by some base sequence contaminants in processed samples. It is expected that such cases will likely increase if not rectified.

One of the strengths of PCR is that once a DNA is known, a highly sensitive test or method34 can be developed. This can be applied to various test targets over a very short period. It means that a highly sensitive detection becomes possible in a shorter time compared with developing an immunological test or technique. The only disadvantage of PCR method is, when there is a similar sequence between them, there is always a possibility that non-specific bands may be generated35. This can happen as in the case of distinguishing bacteria by targeting a specific molecule that is contained only in ribosomal RNA. In this instance, it is difficult to design primers that enable specific detection because 16s ribosomal RNAs can have similar base sequences with each other36,37,38,39. Thus, a test is required in the presence of a similar nucleotide sequence such as when a specific pathogen is found in a sample in which many other species are mixed.

Major parts of PCR primer design technologies were almost completed in the 1990s40. The primer design technique is based on the stability of the hydrogen bond between the primer and the template based on the nucleotide sequence, and the PCR experiments conducted while examining its stability. Its hydrogen bond stability can be predicted by the free energy calculated from enthalpy, entropy, and absolute temperature41. Early basic experiments42 have proven that one base at 3′ greatly affects the PCR reaction, and primers are designed based on 3′ several bases. Software for verifying the easiness of primer application and for designing primers by extracting the susceptible base sequence from the target base sequence was also developed5,6,43. This primer design software, especially the Primer3, has a very large tracking record. Primers designed with Primer3 can amplify target DNA with an accuracy of 80% to 90%. However, even if the conventional primer design algorithm can design the primer that is most likely to cause PCR in the target template, it does not predict the amplification in the template DNA other than the intended one present in the sample. In our preliminary experiment, several Primer3-designed primer pairs amplify all 16sRNA templates regardless of the target DNA on the design of primers. Therefore, for a design of a primer pair that amplifies only the target template with the existence of similar sequences, it is necessary to consider a method different from the previous optimum design.

In the design of PCR primers, it is difficult to compare primer sets relative to each other by the method of selecting the optimum set. When selecting the optimum primer with Primer3 etc., 30 or more indexes are calculated, but a formula that uniformly shows the relationship between those indexes and the actual PCR is not provided7. It is expected that various DNAs in PCR tubes, including templates and primers, and PCR reaction conditions will contribute to the PCR results in different proportions under each condition. For example, the 3′end of the primer is known to have a very large effect on PCR with just a few bases. Although DOP-PCR and similar arbitrary methods are known to nonspecifically amplify a wide range of DNA by matching several bases44. Experiments in the artificial gene synthesis from oligomers have also suggested that the primers are easily elongated when they form dimers21. Not only the binding position of the primer but also the base sequence of the PCR target region may have an influence depending on the annealing temperature. Of course, the base sequences of the primers and templates themselves also affect the results as factors other than mere stability. Thus, to design a primer that performs PCR only on a specific template, not on similar template sequences, it is necessary to consider the unknown number of factors without information about any contribution.

In recent years, supervised machine learning14 has been developed as a method of making predictions without determining the number or combination of factors that contribute to the results. In this method, after preparing data with correct answers, a large number of perceptrons are connected (perceptron network), and the serial adjustment of connection is optimized to form the perceptron network with the highest accuracy rate14,15,16,17. Since the substance of the prediction is a set of coefficients of the perceptron and its network, it is not necessary to analyze the factors for increasing the accuracy rate. Instead, analyzing learned machine learning often does not reveal the factors. Based on the discussion in the previous paragraph, it was expected that supervised machine learning would be suitable, as it does not require the number or combination of factors that contribute to the results to predict the success or failure of PCR.

In this study, PCR results were predicted from the base sequences of primers and templates using natural language processing that examines text trends. The PCR reaction is affected not by the base sequence of the primer or the template alone, but by the combination of complementary strands when they form a complementary strand. Therefore, we decided to generate the code from a combination of PCR primer pairs and complementary strand bases formed by the template. The generated code was split into words so that a sentence was formed from a set of primers and templates. Since a sentence can be created for each primer set and template, if there is a PCR experimental result, the PCR experimental result can be linked as a correct answer to each sentence. In natural language processing, a machine-learning network is made to learn a sentence whose evaluation is confirmed, and the learned network predicts an unidentified sentence. In RNN, which is a typical natural language processing machine learning, RNN is trained in movie criticisms with positive evaluation and negative movie criticism, and the evaluation is predicted for unidentified movie criticism6,45,46. By generating pseudo-sentences using the primer set and template proposed in this paper as a unit, it is possible to associate PCR results with each pseudo-sentence in the same way as Positive/Negative in film criticism. Since the generation of pseudo-words from the complement set alone could not reflect that the complementarity of the 3′end was greater than that of the 5′end, it was emphasized as a word iteration. Therefore, for the learning of pseudo-sentences in this study, the same RNN as the one learned for the evaluation of film criticism was used. This is the first paper to use a neural network application to design primers and predict PCR results. Supervised machine learning was used to learn the PCR results. Since we created pseudo-words and pseudo-sentences as input information, we selected RNNs to learn the relationship between primer and template sequences and PCR results. RNNs can interpret sentences while analyzing the context of words in the sentence. In this study, in a test experiment conducted by actually creating a new primer, prediction was made with an accuracy of 70% or more (Table 5). These results suggest that the interaction between the primer and the template is also effective when the interpreted data of the RNN is returned to the previous layer and used for further interpretation. They also suggest that the effect of primer-template interaction on PCR is similar to the effect of natural language word placement in semantic interpretation. The LSTM used the word context in the sentence to change the retention of the word’s effect for each word and make a comprehensive judgment of its effectiveness47,48.

We created our pseudo-words for RNN analysis for this study (Fig. 2). All of the letters that make up a word were determined based on the primer-template interactions that are important in previous studies (Fig. 2E). Natural language processing by RNN uses all the words used in a specific language, so the vocabulary is about 30 to 100,000 words (RNN literature). In this study, the data was as small as 2,000, so it was necessary to have a small vocabulary. Therefore, the original 16 base combinations are summarized in 5 based on the effect of Taq polymerase on DNA synthesis. However, considering that the primers face each other in the opposite direction during PCR, the direction of homology was reflected in the letters. Besides, different character sets were prepared for dimers and hairpins. Also, uppercase and lowercase letters were set for the evaluation target as the starting point of PCR and other parts. As a result, the vocabulary of the 5-letter pseudoword (pentacode) code was 5 to the 5th power × 5 × 2 = 31,250. In RNN, the characteristics of each sentence are expressed by the amount of words used (word vector) with the vocabulary as the number of dimensions. If the vocabulary is large, the frequency of occurrence of words is low, so the word vector becomes a sparse vector and may not sufficiently show the characteristics of the sentence. On the other hand, when the size of the vocabulary is small, detailed features may not be expressed, which suggests that the prediction accuracy is limited. In the method of this study, the number of characters in a word was shortened to 5 as another method to reduce the size of the vocabulary. It is suggested that extending this to 6 or 7 bases will increase the vocabulary and enable more accurate predictions. In the future, it is thought that this code setting method can be improved by accumulating more data.

In this study, pseudo-words were created based on primer hairpins, dimers, primer-template homology, and primer-PCR product homology. Predicting the priming position is expected to be particularly important among pseudo-words. This is because PCR is established based on the elongation of DNA from the priming position (Fig. 1). When designing the optimum primer as in the conventional case, the binding position of the primer has a long complementary region and high stability as compared with other positions. However, when comparing the complementarity between the template and the primer sequence, which was not originally designed, it is necessary to determine the priming position from a large number of candidates having similar length and stability of the complementary strand. Also, the effect of priming position was conveyed by expressing the priming position in capital letters. The accuracy of this pumping position affects the accuracy of the overall prediction, whereas, in addition to the complementarity with the base sequence and template of the primer, it becomes an amplified sequence or set (reverse for forward, forward for reverse). Thus, its relationship with the priming position is also affected. Therefore, it is ideally desirable to learn and predict this priming position by artificial intelligence. However, since the basic data is not available in this study, the stability of the complementary strand is predicted by the nearest neighbor method. The priming position that maximizes stability was predicted with the set of priming positions. For the prediction of free energy by the nearest neighbor method, in addition to the values reported so far, values extrapolated from those values were set and used. Since some of these numbers are simple extrapolations from the reported numbers, their accuracy is not yet guaranteed, hence, future improvements are still needed.

Improvement in prediction accuracy in RNN is enhanced in the process of repeating epochs (Fig. 4). When all the data were used, the prediction was stable at about 25 epochs, and no significant change occurred. After which, when the number of PCR positive and negative data was matched by undersampling, the error was up to 75 epochs larger. Later transition period of up to 100 epochs made the prediction accuracy become stable. This indicates that the structure of data affects the learning steps of RNN. When the number of data or composition is changed in the future, we proposed to first investigate the changes in epoch and prediction accuracy.

The PCR results used in this study include those that were greatly influenced by primer pairs (Table 4). In 12 of 72 primer pairs, PCR was observed in 20 or more of 31 templates. In 22 primer pairs, PCR was observed in only 1 template. No PCR was observed for 10 primer pairs. Perspectively, these primer pair-template data combinations showed that the predictions were relatively correct when only one of the templates was amplified or when PCR was not applied to any of the templates (Table 7A). This suggests that PCR was successful to primers with high specificity, and conversely, RNNs made highly accurate predictions for primer sets with low PCR characteristics. On the other hand, in the primer pairs in which PCR was observed in a large number of templates, the prediction was relatively wrong, suggesting that it was difficult to predict the RNN for such primers in where false positives frequently appear. The relationship between primer binding to the template and prediction is shown in a scatter plot made with Gibbs energy at the optimal binding position of the primer (Fig. 5). In this scatter plot, the primer and template set specifically designed for lower left area are shown, and the results for the primer pair and template set that do not assume PCR are shown in the upper right region (Fig. 5A). Surprisingly, the prediction did not always hit lower left region, but to the same extent in the upper right (Fig. 5B,C). This tendency was the same for undersampling, suggesting that improvement in prediction accuracy for PCR positive was influenced by improvement in the accuracy rate in the upper right region. For PCR-negative predictions, it is noteworthy that the RNN hit the predictions for multiple PCR-negative sets in the lower left region of the scatter plot created from the predictions of all the data. These results show that the RNN described in this study does not have high accuracy at present, but it is expected that the prediction accuracy will be improved by improving the number of data and reviewing pseudo-words in the future.

It is challenging for RNNs to simplify which of the pseudo-words and their repetitions can have a great influence on the characteristics of supervised machine learning. The correctness of the prediction does not guarantee the correctness of the setting like the pseudo-word. Moreover, through this paper, researchers may now find it useful to reconstruct the prediction method. Pseudo-word generation and pseudo-sentence prediction do not provide the theoretical justification of algorithms based on unified theory, but databased reproducibility can be provided to the user.

In conclusion, it is indicated that PCR design by natural language processing system using RNN be utilized in enabling a primer design to detect a specific template in the presence of multiple templates. Method accuracy is improved by learning the base sequence of the primer pair, the template, and the PCR result. Design can be upgraded by using discarded negative data.

Products You May Like

Leave a Reply

Your email address will not be published.