A machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes such as cellular communication, ligand recognition, and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated.

We present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-, N- and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources.

Please input a sequence in the FASTA format (uncommon amino acids including B, J, O, U, X and Z are not acceptable)


You can also use our local tool to do some predictions

The local version of GlycoMine is provided here. GlycoMine was written in Java, so please make sure your computer has most up-to-date Java installed.

Please type "javaws launch.jnlp" in console (terminal) to run local version of GlycoMine.

The benchmark datasets for C-, N- and O-link glycosylation sites are available for download here

We have investigated the C-, N- and O-linked glycosylation sites for human proteome (84843 proteins) with GlycoMine.

The result files can be downloaded here

It is very easy and straightfoward to use the GlycoMine server to make the prediction.

1. Fill in the text area with a protein sequence in the FASTA format and select the corresponding glycosylation type model.

Note: Since GlycoMine runs on a shared server, the resource for large-scale or batch computations is limited. Therefore less than 5 (or equal to 5) sequences each time for submission are allowed. Contact us if you need large-scale computations and we will be happy to help. Because GlycoMine will extract functional annotations from the UniProt database, a UniProt ID for the query sequence is required. In cases that there is no available UniProt ID for your sequence, you can replace the ID with "XXXXXX" or any other six characters. Acceptable examples (with and without UniProt ID) are shown as follows:

In that way, functional features and functional annotations will not be extracted from UniProt and will not be used by the models. Please be aware that the IDs of submitted sequences should not be identical, as this may confound the prediction outputs due to use of the same IDs among different submitted jobs.

2. Please wait patiently for the prediction result to be returned by our server. Each sequence may take 5 or more minutes. The prediction results are shown as follows. The central residue in 'Adjacent residues' is the predicted glycosylation site.

The thresholds for C-linked, N-linked and O-linked glycosylation are 0.555, 0.500 and 0.502 respectively.

Very importantly, if you find GlycoMine or the related datasets/resources useful for and complements with your research work, please cite the following GlycoMine paper

Li F, Li C, Wang M, Webb GI, Zhang Y, Whisstock JC, Song J. GlycoMine: a machine learning-based approach for predicting N-, C-and O-linked glycosylation in the human proteome. Bioinformatics, 2015, 31(9):1411-1419. doi:10.1093/bioinformatics/btu852



If you have any questions, please do not hesitate to contact us.

Jiangning Song, Ph.D.
Group leader
Biomedicine Discovery Institute (BDI)
Department of Biochemistry and Molecular Biology
Faculty of Medicine, Nursing and Health Sciences
Monash University, Melbourne, VIC 3800, Australia
Clayton campus, Melbourne, VIC 3800, Australia
Tel: +61-3-9902 9304

If you are interested in our other works in the fields of bioinformatics and systems biology, please refer to the following websites for more information: