(1) GenBank file (test.gbk)
---> ORG.fa and ORG.p.table (
SISEQ is
needed. SISEQ is available from
our web site.)
gbk2ptable.pl
test.gbk AB0012345 ORG
(if AB0012345 is the Locus name)
('ORG' should be an ID for the organism)
(2) Catenate *.fa and *.p.table files to prepare your dataset
Use '
cat'
command. The dataset files are test.fa and test.p.table.
(3) ---> test.gfa and test.g.table
gclustsort4.pl
test
(4) ---> test.pin, test.psq, test.phr
formatdb -i
test.gfa -n test
(You may obtain formatdb and blastall
from NCBI)
(5) BLASTP (You should know how to use blastall. You may split your
data into small parts or use multiple threads of blastall.)
blastall -FF
-i test.gfa -d test -p blastp -e 0.001 -m8 -o test.m8
To try the Test Data, you should
start from here. See also 'procedure.txt' in the package.
(6) Gclust
gclust
test.m8 -save -tab=test.g.table -taper -m8
This creates a file data.out. (The
gclust software now accepts m8 table.)
gclust
-read=data.out -hom -thr=1e-20 -out=1
This is a simple clustering using a
single cut
off value.
You can use various options ...
gclust
-read=data.out -hom -clique -org -regroup2
-out=1rs
This creates three files.
test.m8.hom.1,
test.m8.hom.r and test.m8.hom.s. Intermediate results of regrouping are
also produced.
You also need org_file, which describes
definition of organisms. The names of organisms must be determined in
the step (1).
The last option may be '-out=1'. The
option '-regroup' defaults to '-regroup5'. The number after 'regroup'
indicates the level and repeat of regrouping. As many output files are
produced.
(7) Table
homologtableG5.pl
test.m8.hom.1 prefixTEST
test
Here, you need a file prefixTEST, which
describes
the names of organisms.
This will create three files: test.tbl, test.tblA
and test.tblS.
prefix file can be generated from the *.p.table.
ptable2prefix.pl
test.p.table
(8) TBSORT
This is a small program written in C.
You need to
compile it from the source code.
tbsort2
test.tbl 12345 out grp_def pat_def 1
out' is the name of output file. 12345
is the
number of clusters in test.tbl (see the last line). You also need
files, grp_def and pat_def.
(9) Phylogenetic tree (SISEQ is needed)
getgrp.pl
test.m8.hom.1 list > list.hom
list file contains cluster numbers one
per line.
getent test.fa out_dir
Group=list.hom
This creates a directory 'out_dir', and
multiple FASTA files are created therein. You need the SISEQ package.
Then, you may use any alignment software (such as
clustalw, muscle etc) to create an alignment for each FASTA file.
Finally, you can construct phylogenetic tree using a
software whichever you like.