Dihydrouridine (D) is one of the most abundant post-transcriptional uridine modifications found in tRNA, mRNA and snoRNA, closely associated with disease pathogenesis and various biological processes in eukaryotes. Identifying D modification sites is important for understanding the modification mechanisms and/or epigenetic regulation. However, biological experiments for detecting D sites are time-consuming and expensive. Given these challenges, computational methods have been developed for accurately identifying the D sites in genome-wide datasets. However, existing methods have some limitations, and their prediction performance needs to be improved. In this work, we have developed a new computational predictor for accurately identifying D sites called Stack-DHUpred. Briefly, we trained 66 baseline models by employing six machine learning classifiers and eleven different feature encoding methods, each emphasizing unique sequence attributes. Subsequently, the optimal baseline models were identified for the construction of the final stacked model. Remarkably, the Stack-DHUpred outperformed the existing predictors not only on the training dataset but also an independent dataset, indicating that the stacking approach significantly improved the prediction performance. We have made Stack-DHUpred available to the public through a web server (http://kurata35.bio.kyutech.ac.jp/Stack-DHUpred) and a standalone program (https://github.com/kuratahiroyuki/Stack-DHUpred). We believe that Stack-DHUpred will be a valuable tool for accelerating the discovery of D modifications and understanding their role in post-transcriptional regulation.
This proposed method and strategy can be helpful in predicting other DNA/RNA modifications in sequence and the foundation for disease control and drug design research against D modification sites.