With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease–gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.
Current genome-wide association studies (GWAS) for kidney function lack ancestral diversity, limiting the applicability to broader populations. The East-Asian population is especially under-represented, despite having the highest global burden of end-stage kidney disease. We conducted a meta-analysis of multiple GWASs (n = 244,952) on estimated glomerular filtration rate and a replication dataset (n = 27,058) from Taiwan and Japan. This study identified 111 lead SNPs in 97 genomic risk loci. Functional enrichment analyses revealed that variants associated with F12 gene and a missense mutation in ABCG2 may contribute to chronic kidney disease (CKD) through influencing inflammation, coagulation, and urate metabolism pathways. In independent cohorts from Taiwan (n = 25,345) and the United Kingdom (n = 260,245), polygenic risk scores (PRSs) for CKD significantly stratified the risk of CKD (p < 0.0001). Further research is required to evaluate the clinical effectiveness of PRSCKD in the early prevention of kidney disease.
Hung-Lin Chen, Hsiu-Yin Chiang, David Ray Chang, Chi-Fung Cheng, Charles C. N. Wang, Tzu-Pin Lu, Chien-Yueh Lee, Amrita Chattopadhyay, Yu-Ting Lin, Che-Chen Lin, Pei-Tzu Yu", Chien-Fong Huang, Chieh-Hua Lin, Hung-Chieh Yeh, I-Wen Ting, Huai-Kuang Tsai, Eric Y. Chuang, Adrienne Tin, Fuu-Jen Tsai, Chin-Chi Kuo
Alternative splicing is a pivotal mechanism of post-transcriptional modification that contributes to the transcriptome plasticity and proteome diversity in metazoan cells. Although many splicing regulations around the exon/intron regions are known, the relationship between promoter-bound transcription factors and the downstream alternative splicing largely remains unexplored. In this study, we present computational approaches to unravel the regulatory relationship between promoter-bound transcription factor binding sites (TFBSs) and the splicing patterns. We curated a fine dataset that includes DNase I hypersensitive site sequencing and transcriptomes across fifteen human tissues from ENCODE. Specifically, we proposed different representations of TF binding context and splicing patterns to examine the associations between the promoter and downstream splicing events. While machine learning models demonstrated potential in predicting splicing patterns based on TFBS occupancies, the limitations in the generalization of predicting the splicing forms of singleton genes across diverse tissues was observed with carefully examination using different cross-validation methods. We further investigated the association between alterations in individual TFBS at promoters and shifts in exon splicing efficiency. Our results demonstrate that the convolutional neural network (CNN) models, trained on TF binding changes in the promoters, can predict the changes in splicing patterns. Furthermore, a systemic in silico substitutions analysis on the CNN models highlighted several potential splicing regulators. Notably, using empirical validation using K562 CTCFL shRNA knock-down data, we showed the significant role of CTCFL in splicing regulation. In conclusion, our finding highlights the potential role of promoter-bound TFBSs in influencing the regulation of downstream splicing patterns and provides insights for discovering alternative splicing regulations.