I outline here my research experience in Bioinformatics and Computational
Biology obtained during my PhD. I also present research interests and ideas
I am intrigued in pursuing in the future.
My PhD research has mainly been in two areas and their integration: cDNA
microarray data analysis and automated mining of literature for functional
information for gene co-expression clusters and protein families.
Gene Expression Analysis. cDNA microarrays have become a cornerstone technology in Functional Genomics. For example, microarrays have been used to identify gene expression fingerprints for diseases like cancer and to study the whole-genome dynamics of gene expression during different cellular processes, e.g. the cell-cycle or viral infections (the latter two being data I worked on personally [1]). My work comprised the comparison of different multi-variate analysis methods for gene expression analysis and the development of a new technique that uses the projection of gene expression vectors into two-dimensional subspaces to identify significantly expressed genes. So far Singular Value Decomposition [7] has been used to identify such subspaces. The results obtained with this method have been compared to results from commonly used clustering techniques and have been validated biologically. The method has shown to be very powerful for capturing and visualizing important features in the data, like clusters of genes in expression space and the successive activation of genes in temporal expression data1.
There are several promising paths to extend and improve the developed algorithm. The algorithm is currently limited to two-dimensional projections, as it requires density estimates of the projected genes and because higher-dimensional density estimates have shown to be unstable for the amount of data points currently available. To explore higher dimensional spaces, I am interested in implementing an iterative algorithm, that uses different combinations of SVD modes as the two-dimensional subspaces for the projections. Further, I would like to explore other methods that use different criteria for subspace identification than SVD. A method I am specifically interested in is Independent Component Analysis (ICA).
Gene expression data sets that I would like to work with in the future (partly
because they would allow for more gene network analysis oriented work,
as discussed below) are temporal expression data from different biological
processes. I find particularly interesting different kinds of developmental
processes.
Information Retrieval for Mining of Functional Information. The amount of information about biological entities and their interactions is huge and growing fast. Just the biomedical literature database MEDLINE indexes over 12 million articles and it is growing by over 2000 daily. In addition, functional information about biological entities and their interactions is very complex. For example, identifying the functional context of a group of genes (e.g. the cellular processes or pathways the genes of a co-expression cluster participate in) is very difficult. The traditional path of going to databases and retrieving annotations for the genes and then filtering that information for significant and meaningful relationships is very time-consuming, usually requires biological experts and often is not very successful. I worked on using Information Retrieval techniques like Latent Semantic Analysis (LSA) [2] to identify keywords that describe cellular functions and processes of groups of genes and proteins through their associations in the literature. LSA has shown to be useful in finding reduced semantic spaces that capture the important knowledge contained in a corpus of documents indexed by keywords and for discarding obscuring noise in such corpora. Similarly my work has shown that LSA is able to identify functional spaces (or functional themes associated with these spaces) for groups of genes and proteins from their associations with keywords obtained from the literature. So far the source for the keywords in our work has mainly been the controlled, hierarchical Medical Subject Heading (MeSH) vocabulary which is maintained by the National Library of Medicine (NLM). The developed method has been used for the identification of functional themes for co-expression clusters in herpes virus infected human fibroblast cells2 and to explore the clustering of PFAM protein sequence families in literature space3.
There are several ways in which this technique could be developed
further. I am especially interested in exploring the application of literature
mining techniques for gene (and protein) regulation and network analysis
(see also below). Methods of particular interest are network analysis methods
based on weighted graphs, such as proximity networks [6]
and semi-metric analysis of distance networks [5].
Of further interest is the exploration of LSA type techniques not
based on SVD but other spectral techniques like ICA. Also, other sources
of keywords besides the MeSH terms, for example when extracted directly
from papers or abstracts, should be explored. Further sources for keywords
could be different biological databases. For example, the curated protein
sequence database SwissProt contains several keyword fields, giving short
descriptions of protein function, sub-cellular location, and other information.
Some keywords are even linked to specific places in the protein sequence.
Such databases are very rich sources of useful biological information and
to use such databases for mining of functional information and network
inferences seems promising.
From Gene Expression to Gene Regulation and Gene Networks. As
hinted upon earlier, a natural extension to my work so far lies in the
identification of gene regulatory networks and the modeling of such networks.
While I am interested in learning about different approaches that have
been developed in the field, I believe the methods I have already been
working on can be extended to be valuable tools in the field. For example,
my work on the above mentioned projection and density estimation algorithm
for gene expression data showed that the temporal activation of genes'
expression levels is often captured in two-dimensional subspaces (identified
with SVD). Such observations lead naturally to hypotheses about causal
relationships and possible interactions of genes and of cellular processes
and pathways related to these genes. Combined with the literature analysis
method I worked on, which is able to functionally annotate groups of genes
and their potential relationships, and which provides a second space, the
literature space, to explore the functional similarities and relationships
of genes, a potentially powerful integrative technique for analysis of
gene regulation and gene networks might be developed.
Integration of Data and Methodologies. The integration of different data sources and technologies is of central importance for Functional Genomics and Systems Biology. My work aiming at the integration of literature mining for functional information with the analysis of gene expression and protein family classification has shown promising results. A natural next step of this work is to try to integrate these techniques with other functional genomics data and techniques, like data and analysis techniques from proteomics.