GO analysis

E-mail Print PDF


In order to bring order in the chaos due to the exponential increase in the volume of functional genomics data, arose the need for ontology a structured vocabulary of known biological information at different levels of granularity.

According to Gruber [1] Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing. Ashburner [2] pioneered the creation of the Gene Ontology (GO) project aiming at capturing the increasing knowledge on gene function in a controlled vocabulary applicable to all organisms. Currently there are many such structured vocabularies ( for an analytical list of ontologies pls. Refer to : http://anil.cchmc.org/Bio-Ontologies.html) used to represent biological entities and functions, each though is specialized in a certain field of the biological science, i.e. Protein Ontology, anatomy ontology , multiple alignment ontology and others.

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. The use of GO terms by several collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so they can be queried at different levels: for example, using GO to find all the gene products in the mouse genome that are involved in signal transduction, or zooming in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product. (Gene ontology consortium general FAQ)

Tools developed for the analysis of GO term enrichment can be found at http://www.geneontology.org/GO.tools.shtml

In summary according to a presentation by Emily Dimmer EBI Cambridge UK GOA group, GO terms can be used to:

Access gene product functional information

Provide a link between biological knowledge and …

gene expression profiles

proteomics data

Find how much of a proteome is involved in a process/ function/ component in the cell

using a GO-Slim

(a slimmed down version of GO to summarize biological attributes of a proteome)

Map GO terms and incorporate manual GOA annotation into own databases

to enhance the given dataset

or to validate automated ways of deriving information about gene function (text-mining).

Statistics on GO

The most common statistical tests to assess the enrichment of a GO term in the test set are the hypergeometric test (http://en.wikipedia.org/wiki/Hypergeometric), in which sampling occurs without replacement, and the binomial test (http://en.wikipedia.org/wiki/Binomial_distribution) , in which sampling occurs with replacement. The hypergeometric test answers this question in the form of a P-value, while the binomial, provides only an approximate P-value, but requires less calculation time.

The basic question answered by these tests is as follows: when sampling X genes (test set, list of significant genes) out of N genes (reference set, either a graph or an annotation), what is the probability that x or more of these genes belong to a functional category C shared by n of the N genes in the reference set?.

Extending the limitations imposed by the hypergeometric test as means to get more stringent criteria of confidence limits on top of the results of the hypergeometric test, the bootstrapping statistical method [3]. This method uses the population of appearances grouped as elements used also by the hypergeometric distribution

Bootstrapping (http://en.wikipedia.org/wiki/Resampling_(statistics)#Bootstrap) is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. It may also be used for constructing hypothesis tests. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors.



Reference :


  1. Thomas R. Gruber (1993). A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2), 1993, pp. 199-220
  2. Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat Genet., 25, 25–29
  3. Storey, J. D. (2002) A direct approach to false discovery rates. J. Roy. Statist. Soc. B., 64, 479–498.