Our Research

The ultimate goal of our research is to find protein biomarkers for cancer diagnosis through non-invasive tests, e.g., blood test, cancer can be diagnosed at early stage. Currently, Food Drug Administration of the United States of America approves only eight serum protein biomarkers for cancer diagnosis (and one non-serum biomarker). All of the eight serum biomarkers are secreted and glycosylated proteins (glycoproteins). Half of their localizations are in plasma membrane, and the other half extracellular. Furthermore, half of them are membrane proteins, which also account for over half of drug targets as reported in the literature. Therefore, we particularly work on mass spectrometry-based proteomics and metabolomics, post-translational modification, membrane proteomics, and membrane protein structure prediction.

Research Areas

1.  Mass Spectrometry-based proteomics

(1) Automated quantitation analysis
Protein identification and quantitation are two main tasks in mass spectral analysis. Previously we focused on developing bioinformatics systems for quantitation analysis and developed three tools, i.e., Multi-Q for iTRAQ-labeling quantitation, MaXIC-Q for iCAT- and SILAC-labeling quantitation, and IDEAL-Q for Lable-free quantitation, comprehensively covering various popular quantitation experiment approaches. (2) SWATH-MS data for untargeted proteomics analysis
SWATH is a data-independent acquisition method, which was developed in recent years mainly for targeted proteomics analysis and has attracted much attention. Since the high-throughput data generated from SWATH-MS is mainly for targeted proteomics analysis, we propose a method, called ProDIA, to generate in silico MS/MS spectra from SWATH-MS datasets so that the generated MS/MS dataset can be searched by database sequence searching tool, e.g., MASCOT, for protein identification. Combining search results of spectral data acquired from conventional data-dependent acquisition and from SWATH-MS can enhance protein identification, i.e., more proteins can be identified from the bio-sample.
(3) Raw spectral data converter
Most of current existing converters to generate peak list files from spectra raw data suffer some limitations, e.g., without providing charge state of each peak in a spectrum, intensity and m/z of some peaks inconsistent with raw data. We are developing a method to generate peak list files from raw data that can generate peaks consistent with raw data and provide charge information of each peak.
(4) Spectral library searching and disease-centric spectral libraries

Spectral library searching has been recently arising as an alternative to the conventional database sequence searching. Estimation of false discovery rate in spectral library searching requires a decoy spectral library. We have developed a fast, simple method, called PSDG, to generate good decoy libraries. Furthermore, we are constructing disease-centric spectral library to facilitate biomedical research.

2. Mass Spectrometry-based metabolomics.

(1) Metabolite quantitation from high-throughput MS1 data Since only few tools for metabolite quantitation, we have developed an automated metabolite quantitation tool, called Metab-Q, which provides highly accurate quantitation.
(2) Metabolite identification from high-throughput MS1 data

We have proposed a computational method for metabolite identification, called Metab-ID, which includes an effective clustering to group a metabolite with its fragments and then searches against different metabolite databases. The proposed method can lead to identification with high sensitivity and accuracy.

3. Post-translational modification analysis
We are interested in improving protein identification on modified proteins. First, we worked on marker ion detection in high-throughput tandem mass spectra from modified proteomics, e.g., nitrosylated proteins. Second, since glycosylation is considered the most important post-translational modification (PTM) and analysis of MS/MS data acquired from glycoproteomics experiment is challenging, we recently have proposed novel methods for glycoprotein identification and implemented an automated tool called MAGIC.

4. Membrane protein structure prediction and signal peptide prediction

We work on structure prediction particular for transmembrane (TM) proteins since membrane proteins are prominent drug targets and TM proteins is a major type of membrane proteins.

(1) Transmembrane helix and topology prediction
We have proposed a method, called SVMtop, based on machine-learning models in a two-stage hierarchical framework.
(2) Transmembrane helix-helix prediction
After determining membrane protein topology, we developed a prediction method for TM helix-helix interaction and contact prediction, called TMhit, which is the second paper appearing in the literature on this topic.
(3) Structural database for helix-packing folds in transmembrane proteins
We have examined closely into the physical constraints governing helix-packing (i.e., crossing angles, closest point of contact, etc) and construct a database, called TMPad, for all known helix-helix interactions in currently available structures. TMPad serves as a potential template library for reconstruction of a tertiary structure model.
(4) Prediction of solvent and lipid accessibility
From the perspective of structural modeling, the rotational preference of a TM helix is a strong determinant of its interacting faces with the rest of the protein structure and the lipids. We presented a method based on support vector machines to classify the buried or exposed state of each residue in a protein and a method based on support vector regression to predict the solvent and lipid accessibility. Based on predicted lipid accessibility, we can predict rotational angle of a TM helix.
(5) Signal peptide prediction

Secreted proteins are an important type of proteins for biomarker candidates. Signal peptide is a short sequence starting from the N-terminal of a protein that affects protein secretion. Though there are several secretion mechanisms and signal peptides are not responsible for all of the secretions, predicting signal peptide in a protein is currently the only convenient way to identify secreted proteins. However, signal peptides are highly complex functional sequences that are easily confused with transmembrane domains. Such confusion would obviously affect the discovery of secreted proteins and transmembrane proteins. Therefore, we developed a generic prediction method based on machine learning, called SVMSignal, to learn the structures of signal peptides directly by features acquired from a novel encoding to capture biochemical profile pattern. SVMsignal achieved good performance in comparison with many other methods.

5. Disease-centric human proteome database.

We are currently developing a human proteome database. This database particularly contains comprehensive information of human membrane proteome. Using this database, we join researchers in Taiwan to work on human chromosome 4 in the Chromosome-centric Human Proteome Project, an international project organized by Human Proteome Organization. In the current stage, the main theme of c-HPP is to detect missing proteins. Based on our bioinformatics expertise, we determined a list of missing proteins in chromosome 4 for our collaborators to experimentally detect them.

6. Protein function and subcellular localization predictions

Since Determination of protein subcellular localization (PSL) sites through wet-lab experiments is labor intensive and time consuming, we have developed a computational approach, called UniLoc, to develop a universal predictor for proteins regardless of their organisms. UniLoc uses natural language processing techniques to define protein synonyms. A protein synonym is a peptide of n amino acids that indicates a possible sequence variation in the evolution of a protein. UniLoc is built on a proteome-scale database and includes localization sites in prokaryotic and eukaryotic organisms. It can efficiently distinguish between single- and multi-localized proteins and predict localizations with high precision and recall, outperforming most existing predictors. Furthermore, UniLoc can also interpret a prediction with identified template sequences in the database.