Bio-IT-Station


HK Tsai Lab of Bioinformatics

Institute of Information Science, Academia Sinica

News

Spring Outing: Strawberry Picking

After a heated vote, we determined to pick strawberries at Dong Lin strawberry farm, located at Bishan in Neihu. Enjoyed nature and relaxing time! Making jam with strawberries. Cozy hiking

Lab Lunch & Trip: Riddle City

We had a wonderful lab gathering with intern presentations, a farewell & welcome lunch! A puzzle game “Riddle City - 捷運踩地

Summer: Weclome New Friends

Welcoming new interns and friends to our lab—let’s explore, learn, and grow together!

Projects

TFAS

A bioinformatics-based exploration on the promoter occupancy and alternative splicing in the human genome

LncRNA

Reveal the function of lncRNA on the transcriptional regulation and epigenetic regulation

Lab Members

Principal Investigator

Avatar

Huai-Kuang Tsai

Research Fellow/Professor

Evolutionary Algorithm, Bioinformatics, Regulatory Mechanism, Metagenomics, Computational Biology

Researchers

Avatar

Yu-Hsuan Huang

Postdoctoral Researcher

Machine Learning, Genomics, Virology

Avatar

Bing-Shiun Tsai

Research Assistant

Machine Learning, Bioinformatics

Avatar

Kai-Ze Zhu

Research Assistant

Statistical Computation, Machine Learning, Variables Selection in High Dimensional Data, Genomic

Avatar

Shu-Qi Yu

Research Assistant

Bioinformatics, Network, Graph Theory, Algorithm

Avatar

Ting-Yu Yeh

Research Assistant

Machine Learning, Network Biology, Genomics

Grad Students

Avatar

Ru-Yin Jian

Doctoral Student

Machine Learning, Bioinformatics, Cancer

Avatar

Shang-Kok NG

Doctoral Student

Bioinformatics, Cancer

Administration

Visiting Scholars

Avatar

Jia-Hsin Huang

Assistant Professor

Insect Physiology, Bioinformatics, Genomics

Avatar

Wong Jin Yung

Assistant Professor

Evolution, Genomics, Machine Learning, Biomechanics

Alumni

Recent Publications

Quickly discover relevant content by filtering publications.

Complete end-to-end learning from protein feature representation to protein interactome inference

Background Co-fractionation coupled with mass spectrometry (CF-MS) is a powerful strategy for mapping protein–protein interactions (PPIs) under near-physiological conditions. Despite recent progress, existing analysis pipelines remain constrained by reliance on handcrafted features, sensitivity to experimental noise, and an inherent focus on pairwise interactions, which limit their scalability and generalizability. To address these difficulties, we introduce FREEPII (Feature Representation Enhancement End-to-End Protein Interaction Inference), a unified deep learning framework that integrates CF-MS data with sequence-derived features to learn biologically meaningful protein-level representations for accurate and efficient inference of PPIs and protein complexes.

Results FREEPII employs a convolutional neural network architecture to learn protein-level representations directly from raw data, enabling feature sharing across interaction pairs and reducing computational complexity. To enhance robustness against CF-MS noise, protein sequences are introduced as auxiliary input to enrich the feature space with complementary biological cues. The supervised protein embeddings further encode network-level context derived from complex annotations, allowing the model to capture higher-order interactions and enhance the expressive power of protein representations. Extensive benchmarking demonstrates that FREEPII consistently outperforms state-of-the-art CF-MS analysis tools, capturing more biologically coherent and discriminative protein features. Cross-dataset evaluations further reveal that integrating multimodal data from diverse experimental contexts substantially improves the generalization and sensitivity of data-driven models, offering a scalable, cross-species strategy for reliable protein interaction inference.

Conclusions FREEPII provides a unified computational framework that integrates CF-MS data and sequence-derived features to learn discriminative and biologically consistent protein representations. By leveraging multimodal inputs through a coherent deep learning architecture, the model achieves accurate and scalable inference of PPIs and protein complexes across species. Its modality-aware design and supervised protein embeddings capture higher-order interaction contexts, ensuring robust generalization and reliable discovery of novel interactions. Overall, FREEPII offers a flexible and extensible foundation for data-driven exploration of protein interaction networks.

A large language model framework for literature-based disease–gene association prediction

With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease–gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.

Discovery and prioritization of genetic determinants of kidney function in 297,355 individuals from Taiwan and Japan

Current genome-wide association studies (GWAS) for kidney function lack ancestral diversity, limiting the applicability to broader populations. The East-Asian population is especially under-represented, despite having the highest global burden of end-stage kidney disease. We conducted a meta-analysis of multiple GWASs (n = 244,952) on estimated glomerular filtration rate and a replication dataset (n = 27,058) from Taiwan and Japan. This study identified 111 lead SNPs in 97 genomic risk loci. Functional enrichment analyses revealed that variants associated with F12 gene and a missense mutation in ABCG2 may contribute to chronic kidney disease (CKD) through influencing inflammation, coagulation, and urate metabolism pathways. In independent cohorts from Taiwan (n = 25,345) and the United Kingdom (n = 260,245), polygenic risk scores (PRSs) for CKD significantly stratified the risk of CKD (p < 0.0001). Further research is required to evaluate the clinical effectiveness of PRSCKD in the early prevention of kidney disease.

Contact