My research focuses on gene regulatory mechanisms and transcriptomics. I apply computational techniques and statistical methods on data from high-throughput biological experiments. Transcription of genes is mainly controlled by the interaction between transcription factors (TFs) and their binding sites (TFBSs). I am particularly interested in inferring the functions of TFs and their recognized TFBSs. In conjoining with studies on TFs and TFBSs, I also investigate factors that play conspicuous roles in regulatory mechanisms, including chromosome accessibility, noncoding RNAs, and DNA structure. I mostly study model organisms Saccharomyces cerevisiae and Homo sapiens.

My second research focus is on metagenomics, which analyze samples directly taken from the environment. Utilizing data produced from high-throughput sequencing technology, I improve the data processing methods of the noisy metagenomics data, and determine the systemic properties of microbial communities. My research highlights are summarized below.
Three major research interests of our lab are described below:

Investigations on Transcription Factor Binding Sites(TFBSs)

Building on my previous contributions on the identification of TFBSs, I have expanded to investigating characteristics of TFBSs and other associated factors of TF binding events. The abundant and diverse genomic data provide a perfect platform for studying regulatory mechanisms based on TF and associated factors.

We started the incorporation of other factors to TF regulations. Nucleosomes have been found to have a profound impact on gene regulation but the contribution to transcriptional evolution is relatively unexplored. We thus studied the role of nuclesome positioning in the evolution of TFBSs. We compared TFBS frequency and TFBS nucleotide variants in nucleosome occupied regions and nucleosome depleted regions in promoters of old (orthologs among Saccharomycetaceae) and young (Saccharomyces specific) genes; and in duplicated gene pairs. Our results show that nucleosome occupied regions accommodate greater binding site variations and a higher evolutionary rate than nuclesome depleted regions. We further conducted site-directed mutagenesis and showed that binding site gain or loss at nucleosome depleted region may cause a higher expression difference than those in nucleosome occupied regions.

Expanding on the nucleotide variants of TFBSs, we further explored the association of variable position in TFBSs of a TF and TFBSs of other TFs that exist consistently with the first TF. These two TFs are considered as a co-occurring TF pair, which is determined if the TFBSs of two TFs co-occur in a set of promoters. We considered both low- and high-affinity TFBSs in a genome-wide analysis. We found that variable positions are generally conserved and are significantly associated with other co-occurring TFs. Most of these associations also have significant functional enrichment and synergistic effects on their target genes.

We further explored the synergistic effects of TFs and eventually developed into another study. Two TFs are said to be synergistic if genes regulated by both TFs show stronger co-expression pattern than regulated by either TF alone. We developed a likelihood-based method to identify interacting TF pairs and their TFBSs simultaneously. The method identifies well-conserved and over-represented motifs that are enriched in the promoters of a set of genes. Thus it is likely that these TF pairs co-regulate their target genes, and that the inferred motifs are the respective binding motifs of the TFs.

The synergistic effects of TFs may possibly be the result of two interacting TFs regulating the same gene, as TF regulation often involves cooperativity between a set of TFs. TF-TF interaction is determined if the influence of one TF on its target genes depends on another TF. We developed a novo method, simTFBS, that discovers TF-TF interactions on a gene. SimTFBS incorporates de novo motif discovery as a fundamental step when detecting shared targets of TFs based on ChIP-chip data. SimTFBS recruits more genes with low-binding affinity from ChIP-chip data by requiring the presence of de novo identified motif. It outperforms naïve methods and has advantages over other two advanced methods. By comparing simTFBS with predictions based on a set of available annotated yeast TF binding motifs, we showed that the incorporation of de novo motif discovery indeed improves the accuracy of inferring TF-TF interactions.


Regulatory Mechanisms

Since TFs and TFBSs provide a wealth of insights to gene transcription, we thus enlarged our scope and studied the transcriptional process based on our solid knowledge and skill in TFs and TFBSs. We started by studying divergent gene pairs. A divergent gene pair (head-to-head or bidirectional) comprises two adjacent genes whose transcription start sites are located on the opposite strands of DNA with adjacent 5’ ends. Divergent gene pairs are abundant throughout the genome, particularly it constitutes as high as 10% of the human genome. Since divergent gene pairs share the same promoter region, it is suggested that they could be co-regulated by the same set of regulatory elements. To identify the regulatory mechanism, we integrated TFBSs and multiple microarray expression datasets to infer the cis-regulatory modules in Saccharomyces cerevisiae. We first examined the expression profiles of microarray knockout experiments of 263 TFs, then conducted a comprehensive study in the yeast genome. Our results show that only a limited number of divergent gene pair shares TFBS, but genes in a divergent gene pair tend to be co-regulated in at least one condition.

We then shifted our focus towards how do the regulatory elements came about as DNA sequences evolves to become a gene. Studying the state of regulatory elements at gene origination helps to decipher the most fundamental elements that are necessary for gene transcription. We studied the cis-regulatory elements in de novo genes, new genes that originated from non-coding sequences, in comparison with duplicated new genes. We found that the number of TFBSs in de novo genes increased rapidly and became comparable to the number of TFBSs in well-established genes shortly. We conjectured that de novo genes have three characteristics that contribute to rapid increase of TFBSs: relatively frequent gain of TFBSs, high number of pre-exisiting TFBSs, low selection pressure in the promoter regions. We further showed that different regulatory strategies may have been employed by de novo genes and duplicated new genes based on promoter architecture and functional analysis, and hence these two types of genes might have different roles in evolution.

Expanding on our investigations of regulatory evolution investigations, we studied TFs and microRNA (miRNA), another crucial transcriptional regulator, as indicators of protein evolutionary rates. We provided a comprehensive study by incorporating ten different indicators of protein evolutionary rates across metazoans, namely, human, mouse and fruit fly. Our results show that metazoans exhibit a negative correlation between the number of transcriptional regulators and evolutionary rate, even when other indicators are controlled. Further analysis show that miRNA is generally the more essential indicator among all the examined indicator, and the combination of TF and miRNA has a significant dependent effect on protein evolutionary rates. We also observed that the contribution of number of transcriptional regulators is higher in vertebrates than invertebrates.

We have also investigated other mechanisms that happen simultaneously with transcription. It has become widely accepted that alternative splicing is coupled with transcription. Additionally, previous studies have found the kinetic activity of RNA polymerase II affects the splicing outcome, and RNA polymerase II kinetics may be affected by G-quadriplex, a four-stranded non-B DNA structure characterized by multiple Gs and short loops, on the DNA. We hence provided a genome-wide and cross-species investigation on whether G-quadriplex and other four non-B DNA structures are associated with exon skipping. Our results indicate a statistically significant correlation between each non-B DNA structure and exon skipping in both human and mouse. The correlation and contributions is also affected by the relative strand and relative position of the occurrence of non-B DNA structure. We thus show that, in addition to the well-known effects of RNA and protein structures, the structures on the DNA sequence may also impact exon skipping.

Transcription factor binding site identification based on chromosome accessibility

    One of the central questions in molecular genetics is on the mechanisms of transcriptional regulation, particularly how TFs interact with their intended targets. Although current computational approaches that detect TFBSs are well established, these approaches are plagued with high false positives. Recent studies have shown that two features of chromosome accessibility, namely chromatin states and DNA structural properties, affect TF binding events. These two features have since been used independently to predict TFBS, but the results are unsatisfactory. We hence systematically explore the relationship of TF binding events by simultaneously incorporating chromatin states and DNA structural properties. We develop a machine learning-based TFBS prediction method based on the integration of the conventional sequence motif feature and these two features. The model demonstrates improved performance. Moreover, the model has further detected that using three specific properties of chromosome accessibility are sufficient in accurately predicting TFBSs. We further generate a TF accessibility profile to aid the recognition of functional TFBSs. Since chromosome accessibility is intrinsic to the DNA sequences and can be obtained based solely on the DNA sequence alone, our method has the capability of predicting functional TFBSs in any sequenced genome.

Analyzing combinatorial regulations on nucleosome dynamics in yeast

    Nucleosomes are a fundamental repeating unit of DNA packaging in eukaryotic chromatin and most of the genome appears to be wrapped in nucleosomes. The position of nucleosomes can influence the accessibility of DNA to proteins or block TFBSs; therefore, they are associated with chromatin accessibility and gene expression variation. With the maturing of high-throughput DNA sequencing technology, many high-resolution maps of genome-wide nucleosome positions have been constructed in recent years. The data has shown that nucleosome positioning is not uniformly distributed and the pattern of nucleosome positions is related to gene regulation and transcription level. Therefore, nucleosome positioning is critical for transcription and most DNA-related activities. Nucleosome positioning is influenced by several factors including intrinsic and extrinsic factors. Intrinsic factors, such as DNA sequence and DNA structure, are a strong determinant for nucleosome formation and inhibition on DNA. Extrinsic factors, such as histone variants, histone modification, chromatin remodeler interactions, and transcriptional factor binding sites can affect and regulate the nucleosome positions. Consequently, the dynamics of nucleosome positioning is determined by both intrinsic and extrinsic factors and the method by which to determine the relationship between each factor and the priorities of the factors is critical to nucleosome dynamics and regulation. By analyzing nucleosome-related profiles of each factor at a genome-wide level and incorporating those information based on statistical networks, we will provide a better understanding for the mechanism of nucleosome dynamics and give a more comprehensive and quantitative mapping of the regulatory mechanisms from an epigenetics perspective.

Identification and Functional Analysis of Enhancer RNAs

    After the discovery that above 80% of the human genome has biochemical function, the interest in non-protein coding transcribed regions has increased. RNAs transcribed from enhancer regions, sections of DNA that are known to enhance the transcription of their specific target genes through looping interactions, are becoming increasingly popular areas of study because current experimentation suggest that they play a role in enhancer function and activity though the mechanism by which they function is mixed and unclear. Although experiments have been performed, to date there have been few comprehensive or bioinformatics-based studies identifying these enhancer RNAs (eRNAs) and examining their functional roles Therefore, the focus of this study is on using RNA-seq from the UCSC ENCODE database to try to identify potential eRNAs from enhancers identified by Shen et al. Examining the correlation between the expression of predicted enhancers and their target genes which are thought to have looping interaction, we identify potential enhancers with eRNAs and the functions of those transcripts across eleven different mouse tissue types using the Rfam Database for annotated RNAs. Finally, the common eRNAs across cell lines are identified and their cellular function are examined for more insight into the functional role of eRNAs.

Investigation of the evolution trend of human long intergenic non-coding RNA

    By applying high-throughput sequencing to human transcriptome analysis researchers found that only a minor proportion of human transcriptome produce mRNAs; the majority of human transcriptome is composed of transcripts that cannot be translated into proteins. These transcripts, known as non-coding RNAs, are believed to play an important role in gene regulation or other novel functions, although their functions remain unclear. The emergence of non-coding RNAs through evolution may hold the promises of better explaining the complex transcriptome in different species. We are particularly interested in a subclass of non-coding RNA called long intergenic non-coding RNA (lincRNA), which are longer than 200 nucleotides and do not overlap with the exons of any protein-coding gene. LincRNAs are important because they are involved in the regulation of chromatin structure or gene expression and in the development of neuron system of vertebrates. According to existing studies, some lincRNAs have only evolved in primates. However, we still lack a clear picture of the evolutionary origin of human lincRNA. In this study, we propose to tackle this problem by comparing human lincRNA with the genome of a large number of other species using comparative genomics. Specifically, we would like to answer the following research questions. First, whether human lincRNA exists in the genome of other species, such that we can attempt to answer the evolutionary origins of the human lincRNA. Second, if human lincRNAs do not exist in other species, how have these lincRNAs evolved from their origins in other eutherians and vertebrate? This study will expand the understanding of the characteristics of lincRNAs.