Gclust Download

All data and software are available free of charge to academic users.
But please respect the rights of the developpers.
Citation:
[1] N. Sato, M. Ishikawa, M. Fujiwara and K. Sonoike (2005)
Mass identification of chloroplast proteins of endosymbiont origin by phylogenetic profiling based on organism-optimized homologous protein groups. Genome Informatics 16: 56-68.
[2] N. Sato (2009)
Gclust: trans-kingdom classification of proteins using automatic individual threshold setting.
Bioinformatics 25: 599-565. Abstract

The software and test data for this paper are available. See 'For DIY people' below. (DIY stands for do it yourself)


Dataset source files
These are very large files archived by tar and compressed by gzip.

Name
Size
Description
Gclust2010
674 M
169 species (Plants, algae, cyanobacteria, bacteria, unicellular eukaryotes, animals)
ALL95
349 M
95 species (Plants, cyanobacteria, bacteria, unicellular eukaryotes, human, C. elegans)
NP28
410 M
28 species (19 animals plus representative bacteria and plants)
CZ36B
75 M
36 species (plants, cyanobacteria, bacteria)
COG2006B
43 M
COG protein data analyzed by Gclust
CZ16Y
37 M
18 species (Arabidopsis, Cyanidioschyzon, cyanobacteria, some bacteria)
org_list
17 k
List of organisms


Selected data
Single organism data for the ALL95 dataset in tab-delimited text. This part will be updated shortly.
Organism
Arabidopsis thaliana
Rice
Moss
Synechocystis sp. PCC 6803
Anabaena sp. PCC 7120


For DIY people
If you want to prepare homolog groups yourself with your favorite dataset, you need the following software.
Please download the software (Gclust, Tbsort, SISEQ) and test data to understand how the data are processed.
A detailed instruction manual is being prepared and will be available soon after the publication of the paper.
All processing should be done on a UNIX (or MacOS X) computer by typing commands one by one.
There is no GUI (graphical user interface).

Test Data
Test data of CZ36 is available for download. This includes the complete data that are used to prepare homolog clusters.
The files are tar archives compressed by gzip. The files can be uncompressed by 'tar zxvf' command, or various archiver software.
All operations should be done on X term (X window on Linux) or Terminal (MacOS X).
Windows machines can be used if Cygwin is installed.
To process the CZ36 test data, you need more than 2 GB memory and a recent CPU as described for the compilation of Gclust software.
TestData.tgz : complete data including output files.
TestDataCore.tgz : only the source files to test.

Read the 00Readme.txt and procedure.txt files before starting.

Gclust software
Source code in C is available as an archive.
Current version is 3.5.5z3. Parallel version 3.5.6 will be made available soon.
This should be compiled by GCC on any UNIX (or Mac OS X with Developper Tools) platform.
First, edit the makefile to set the compiler and compiler options. On common Linux machines, use GCC.
The simplest settings are the followings:
CC = gcc
OPTION = -O
Unnecessary lines should be commented by putting a sharp (#) at the left ends.
Then, type  'make' to start compiling. Many warnings should appear, but the compiling will be complete in a minute or two.
The executable is named 'gclust'.

Intel compiler (ICC) can be used now. Then, you should set the correct path for the compiler (in /etc/csh.cshrc). Use the followings:
CC = icc
OPTIONS = -xT -fast (for X32_64 architecture, such as Xeon, Core2, not Itanium)

For MacOS X (G5, Core or Core2 processors) , use gcc (= cc):
CC = cc
OPTIONS = -fast
For G4 processor, OPTIONS = -O

Currently, the software is being parallelized with OPENMP.
The core algoriithm of gclust (named EOOC) runs faster with 8 or more processors.
A parallel version (3.5.6) will be available soon.

Tbsort software
Source code in C is available as an archive.
Compile the software just as described above.

Scripts
These are scripts (written in PERL) that are used to prepare data for Gclust software from GenBank (or RefSeq) files, or for post-processing.
All scripts are used in the command line.
To quickly learn the usage, just type the command without arguments.
In the examples given below, 'test' is the name of a test file. Replace it with your favorite file name if you process your files.

Name
Usage (see below)
gbk2ptable.pl gbk2ptable.pl test.gbk AB0012345 ORG
gclustsort4.pl gclustsort4.pl test
homologtableG5.pl homologtableG5.pl test.m8.hom.1 prefixTEST test
ptable2prefix.pl ptable2prefix.pl test.p.table
getgrp.pl getgrp.pl test.m8.hom.1 list > list.hom


Usage
(1) GenBank file (test.gbk)  --->  ORG.fa and ORG.p.table (SISEQ is needed. SISEQ is available from our web site.)
    gbk2ptable.pl test.gbk AB0012345 ORG
        (if AB0012345 is the Locus name)
       ('ORG' should be an ID for the organism)
(2) Catenate *.fa and *.p.table files to prepare your dataset
    Use 'cat' command. The dataset files are test.fa and test.p.table.
(3) ---> test.gfa and test.g.table
    gclustsort4.pl test
(4) ---> test.pin, test.psq, test.phr
    formatdb -i test.gfa -n test
       (You may obtain formatdb and blastall from NCBI)
(5) BLASTP (You should know how to use blastall. You may split your data into small parts or use multiple threads of blastall.)
    blastall -FF -i test.gfa -d test -p blastp -e 0.001 -m8 -o test.m8

To try the Test Data, you should start from here. See also 'procedure.txt' in the package.
(6) Gclust
    gclust test.m8 -save -tab=test.g.table -taper -m8
This creates a file data.out. (The gclust software now accepts m8 table.)
    gclust -read=data.out -hom -thr=1e-20 -out=1
This is a simple clustering using a single cut off value.
 You can use various options ...
    gclust -read=data.out -hom -clique -org -regroup2 -out=1rs
This creates three files. test.m8.hom.1, test.m8.hom.r and test.m8.hom.s. Intermediate results of regrouping are also produced.
You also need org_file, which describes definition of organisms. The names of organisms must be determined in the step (1).
The last option may be '-out=1'. The option '-regroup' defaults to '-regroup5'. The number after 'regroup' indicates the level and repeat of regrouping. As many output files are produced.
(7) Table
    homologtableG5.pl test.m8.hom.1 prefixTEST test
Here, you need a file prefixTEST, which describes the names of organisms.
This will create three files: test.tbl, test.tblA and test.tblS.
prefix file can be generated from the *.p.table.
    ptable2prefix.pl test.p.table
(8) TBSORT
This is a small program written in C. You need to compile it from the source code.
    tbsort2 test.tbl 12345 out grp_def pat_def 1
out' is the name of output file. 12345 is the number of clusters in test.tbl (see the last line). You also need files, grp_def and pat_def.
(9) Phylogenetic tree (SISEQ is needed)
    getgrp.pl test.m8.hom.1 list > list.hom
list file contains cluster numbers one per line.
    getent test.fa out_dir Group=list.hom
This creates a directory 'out_dir', and multiple FASTA files are created therein. You need the SISEQ package.
    Then, you may use any alignment software (such as clustalw, muscle etc) to create an alignment for each FASTA file.
    Finally, you can construct phylogenetic tree using a software whichever you like.     


Last update: May 31, 2010.
All rights reserved. Naoki Sato 2008-2010.