Gclust databases

To Gclust top
Brief summary of the Gclust software
Gclust software was developed to make clusters of protein sequences from all predicted protein sequences in a selected set of genomes. The clusters are homolog groups, but not ortholog clusters (see below for the distinction), and therefore, contain all homologous sequences encoded by the selected genomes. An ortholog cluster, such as the one in KEGG Orthologs or COG in NCBI, contains only a single sequence for each genome, and such a single representative is usually selected by a criterion called "bi-directional best hit". By contrast, a homolog group contains all reliable homologs, that represents a gene family. However, we need several techniques (see the next section for specialists) to obtain good homolog groups, because a very large group of sequences consisting of unrelated sequences could be formed if similar sequences are simply added to a homolog group. Gclust uses E-value of BLASTP and overlap score (representing the proportion of homologous regions shared by two sequences) as a two-dimensional matrix, to select for the proper E-value and overlap score for each homolog group so that not too many homologs are put into the group. To do so, number of organisms is also considered. Detailed explanation of the algorithm was presented in the GIW2005 paper.

Gclust databases
The Gclust software can be used with any set of genomes. We have been working on photosynthetic organisms, and are interested in finding conserved proteins in prokaryotic and eukaryotic photosynthetic organisms. Therefore, initial datasets included mainly photosynthetic organisms with some non-photosynthetic organisms for comparison. The datasets, CZ16Y, CZ20x, CZ30 and CZ35, are such datasets including different number of genomes. The results of computation are now available for the public through a web interface. The dataset Bact129 includes 132 species of bacteria. The dataset ALL145 includes animals and plants (including algae) as well as many bacteria and Archaea. Organellar genomes are also included, but selection is only possible on organisms (nuclear genome and mitochondrial genome, as well as chloroplast genome, if present). For organellar studies, datasets including all available chloroplast genomes (plus cyanobacterial genomes) or many mitochondrial genomes of photosynthetic organisms are also provided in this web site. They are named CPBACT8x and Mt23, respectively.

Uniqueness of the Gclust software
Many researchers use BLASTP to search homologous sequences in the non-redundant databases. But the results are usually difficult to interpret, because many similar sequences rank high. There are sometimes duplicated entries of an identical sequence. The Gclust databases are pre-calculated similarity matrices, which show all homologs in the selected dataset. Users do not need to perform an iterated BLASTP search.

If you are not satisfied with the provided clusters, ...
it is time to make clusters by yourself. Use the Gclust software to make clusters from a genome set containing your favorite organisms. Currently, detailed usage of the Gclust software as well as its associated software is being prepared.

Use of the Gclust software (for computational biologists)
The Gclust software is written in C, and runs on any common UNIX platforms including Mac OS X. Memory requirement depends on input data and mode of operation. When all the 102,513 predicted proteins encoded in the four eukaryotic (including organelles) and 13 prokaryotic genomes are clustered, about 9 GB memory was used on SGI Onyx3400. The computation of the ALL145 dataset required more than 2 weeks using the supercomputer system in the Human Genome Center at the University of Tokyo.
    In a typical flow of database construction (see the figure below), a set of genomes is defined as a dataset. The GenBank format files (gbk files) for the selected genomes are retrieved. Then, PERL scripts are used to prepare a protein sequence file and an annotation table for the entire dataset. All sequence manipulations by the scripts internally invoke the SISEQ commands (Sato, N. 2000. Bioinformatics 16: 180-181). The protein sequence file is used for all-against-all BLASTP (version 2.2.12) analysis. The results are parsed by a script to extract significant homology with 1 x 10-3 as an E-value of cutoff (a3 file). The a3 file and the annotation table as well as a definition of organisms are used as inputs for Gclust software. Gclust was first run in the 'save' mode to prepare an intermediate file 'data.out'. A 'tapering' or 'ashikiri' option is provided to remove low homology data, with keeping low homology data for short sequences (from 1e-6 for >100 aa to 1e-3 for <40 aa). In Gclust, homology data are handled as a chunk called sqlist, holding region to region similarity, namely, coordinates of similarity region in both (query and target) proteins and E-value. Therefore, a combination of two proteins may have many sqlist data, depending on the domain structure and repeat.

    In the second step, Gclust reads the data.out file, and performs clustering according to various options. However, the most useful option is the -clique option, which produces a good clustering result in relatively short time (within one day). In the clique mode, the sqlist data are converted to match data, which hold data of binary (i.e., protein to protein) similarity, namely, best E-value among sqlist, overlap score showing total overlap region devided by total length, and domain structure estimated from homology segments. Normally, clique mode requires org_list data, listing organisms. For each protein, all match data are tabulated in 2D, using E-value and overlap score. Match data are selected one by one starting from the corner with the highest E-value and highest overlap score. Various criteria are applied, but essentially, a clearly defined cluster of match data with respect to E-value and overlap score is selected. In addition, match data are selected to cover as many organisms as possible but without picking up very low similarity data.  After such purification of match data, idlist holding list of IDs of homologs is made for each proein. The threshold E-value and overlap score are also stored. Then, homolog clusters are formed by merging individual idlists. At this stage, idlists with very diffent threshold E-values are not merged. After a repeat of merging and removing, isolated proteins generated by removal step are again incorporated into the most adequate cluster. Homolog groups are sorted according to the number of entries. Finally, homolog groups are printed out to a large file as a canetated similarity matrix. The matrix may be expressed in 1 (similar) - 0 (dissimilar), E-value, or overlap score, depending on output options, 1, r, or s, respectively.

Using a perl script homologtableG4.pl, the homology matrix can be transformed into a table showing members of each homolog group.

Then, tbsort2 software (written in C) is used to select homolog groups that are conserved in a selected set of organisms. We call this "phylogenetic profiling", which may be useful to extract conserved proteins of unknown function, which might be involved in the pathway or process that are shared by the set of organisms. We apply this principle to extract "chloroplast proteins of endosymbiont origin" or CPRENDOs. But other usage of the phylogenetic profiling might be possible.

-- additional old explanation --

In the basic mode with the -hom option, single-linkage clustering is performed with an E-valueas a threshold. In this case, all the homologues that are linked by a single homology relationship are placed in a single cluster. Such clusters are used as discrete characters to make a parsimony tree (using the PAUP software) that we call 'genome tree'. With -repeat option, an iterated clustering is performed by changing the threshold E-value from the lowest (such as 10-50) to the highest (such as 10-3). During the iteration, an abrupt increase in the number of members of a cluster is taken as a sign of formation of unnatural cluster including distantly related or multidomain proteins. An additional criterion is the overlap score, which is defined as the sum of length of homology regions over the entire sequences divided by the sum of the lengths of the two sequences to be compared. Another criterion is the complexity of domain structure, which is estimated based on the BLASTP data and which is used to eliminate multidomain proteins during the initial iteration. By these criteria, concise or natural clusters are extracted and removed from further clustering with higher E-values. In an additional mode with -homsub option, the final clusters are further sub-clustered to maximize similarity within each subgroup.

Example homolog group

Image file i

Example alignment

Alignment image is about 250 kb.

Attention! Good targets and unsuited proteins

It is essential that you recognize what you are looking for in the Gclust database. Gclust database consists of clusters of homologous proteins. Some proteins belong to large protein families, while others are orphans. Some proteins are well characterized by experiments, while others are still in the hypothetical state. The author of Gclust originally aimed at recognizing conserved hypothetical proteins in various different phyla. Therefore, a recommended usage of the Gclust database is to find conserved hypothetical proteins. Another trivial usage is to get all possible homologs to construct phylogenetic tree. What is not to be intended is to find a homolog of transcription factors and kinases. A simple desire to find a homolog of functionally important molecules may be met by a sophisticated phylogenetic analysis of all possible homologs. In the Gclust databases, some clusters containing large protein families are very large and are not well resolved. The top ten large clusters include DNA-binding proteins, RNA-binding proteins, serine/threonine-kinases, histidine kinases, response regulators, components of ABC transporters. I agree that these are important functional molecules in biological systems, but the functional classification is not easy.
There are various different reasons that these proteins are not suited for Gclust database.
First, structurally similar proteins are clustered in the Gclust software using the results of BLASTP. In the large clusters consisting of similar sequences, a more rigorous phylogenetic analysis is necessary to correctly classify homologs. The clusters in the Gclust database may not correctly reflect phylogenetic clusters. Second, various transcription factors and kinases contain additional functional domains. In the Gclust algorithm, multidomain proteins are often separated. However, many biologists want to obtain transcription factors having a similar DNA-binding domain, disregarding additional domains. In this case, Pfam analysis may be more helpful. Finally, sequence similarity and functional relatedness are different. Proteins of similar sequences may be involved in different cellular functions or pathways. Therefore, a single cluster of ABC transporters contains various transporters involved in transport of different molecules. Many biologists are disappointed to find such a situation. However, transporters are similar with one another, even if they transport different molecules. Structural similarity arising from phylogenetic relationship may be more apparent than similarity of substrate binding sites. In this case, the Gclust clusters do not correspond to functional classification of transporters.
Please keep this attention in mind to exploit the Gclust database.


N. Sato (2009)
Gclust:trans-kingdom classification of proteins using automatic individual threshold setting.
Bioinformatics (on-line access) doi: 10.1093/bioinformatics/btp047. Abstract

N. Sato, M. Ishikawa, M. Fujiwara and K. Sonoike (2005)
Mass identification of chloroplast proteins of endosymbiont origin by phylogenetic profiling based on organism-optimized homologous protein groups.
Genome Informatics 16: 56-68.

N. Sato (2003)
Gclust: genome-wide clustering of protein sequences for identification of photosynthesis-related genes resulting from massive horizontal gene transfer.
Genome Informatics 14: 585-586.

N. Sato (2002)
Comparative analysis of the genomes of cyanobacteria and plants.
Genome Informatics 13: 173-182.

Data sources

GenBank Databases: NCBI
Unfinished Genome Data in JGI: JGI
Cyanidioschyzon merolae Genomic Data CGP

Distribution of data and software
All downloads should be done from the Gclust Download page.
2. Gclust software is available as the source code for UNIX.
The software is distributed for academic use.
The copyright is kept by N. Sato.
Re-distribution of the software is not allowed without permission.
If you agree, you may download the software from the links below.
If you download the software, you are automatically assumed to agree with this condition.
In many scripts, SISEQ commands are used. Install SISEQ package before using these scripts.
SISEQ package is available from http://nsato4.c.u-tokyo.ac.jp/old/Siseq.html.
Please read the description in the upper part of this document for the flow of data processing.

2. Example of data processing. Note that this is an old version. See the latest version in the download page.
(1) GenBank file (test.gbk)  --->  test.fa and test.p.table (SISEQ is needed)
    gbk2ptable.pl test.gbk AB0012345 test
(2) ---> test.gfa and test.g.table
    gclustsort4.pl test
(3) ---> test.pin, test.psq, test.phr
    formatdb -i test.gfa -n test
(4) BLASTP (You should know how to use blastall.)
    blastall -FF -i test.gfa -d test -p blastp -e 0.01 | bl2ls3.pl - 1e-3 > testa3
    blastall -FF -i test.gfa -d test -p blastp -e 0.01 -o test.result
    bl2ls2.pl test.result 1e-3 > testa3
    gclust testa3 -save -tab=test.g.table -taper
       This creates a file data.out.
    gclust -read=data.out -hom -thr=1e-20 -out=1
     This is a simple clustering using a single cut off value.
    You can use various options ...
    gclust -read=data.out -hom -clique -org -regroup2 -out=1rs
       This creates three files. testa3.hom.1, testa3.hom.r and testa3.hom.s.
       You also need org_file, which describes definition of organisms. The names of organisms must be determined in the step (1).
(6) Table
    homologtableG5.pl testa3.hom.1 prefixTEST > test.tbl
    Here, you need a file prefixTEST, which describes the names of organisms.
    tbsort2 test.tbl 12345 out grp_def pat_def 1
    'out' is the name of output file. 12345 is the number of clusters in test.tbl (see the last line). You also need files, grp_def and pat_def.
(8) Phylogenetic tree (SISEQ is needed)
    getgrp.pl testa3.hom.1 list > list.hom
       list file contains cluster numbers one per line.
    makefa3b.pl list test.fa
       This creates a directory 'seqs', and multiple FASTA files are created therein.
    Then, you may use any alignment software (such as clustalw, muscle etc) to create an alignment for each FASTA file.
    Finally, you can construct phylogenetic tree using a software whichever you like.     

Copyright © 2006-8 Sato Lab. All Rights Reserved.
Last update: Jan. 22, 2009.

Back to Top