Sieve - a tool for 3D protein structure description, comparison and classification

Description: Sieve reads through a directory and calculates average occurrences of patterns of one, two and three crossings of the carbon alpha path of all chains of protein pdb-files it finds.

Download: The Sieve software may be downloaded on these terms.

Compile: Sieve is compiled entering  >cc Sieve.c -lm -O3 -o Sieve

Run: To run GI enter >Sieve 20 100 25 /path/to/pdb_file_directory/ output.file 0.000000001
The program takes five or six arguments.
The first two arguments are limits for the length of proteins it should process. For example,  20 100 would indicate that the program should only treat proteins for which the number of carbon alphas is between 20 and 100.
The third argument, 25, is a triangulation parameter.
The fourth argument is a directory name (which must end with a " /" ) in which the program will look for protein data files.
The fifth argument is the name of a file into which output data is to be written. Note, this file is only appended.
The optional sixth argument, 0.000000001, is the amplitude of random noise to be added to atomic coordinates. If omitted, this amplitude is set to zero.

The program: Sieve searches through the given directory for files ending in `` .pdb''. For each such file, it reads through its output file (which is not overwritten, but only appended to) to see if there already is an entry for that protein. If so, it passes over to the next one. If not, it computes the measures for this new protein if it can, and appends a line to the output file if it could.
The output file is only opened for reading and writing, but not during any computation. Once a line is appended to the output file, the output stream is flushed (any buffered but unwritten data is written). This means that the program can be aborted and restarted without losing more than the computation in progress (i.e. one single protein). It also means that one can first set the program to treat a set of proteins without any perturbation of atomic coordinates (i.e. no sixth argument). It will compute the measures of those it can, but not produce an output line for those which caused numerical problems. One can then start again with a small perturbation to treat the remainder.

Output: The columns of output.file are
pdb.file   chainID   #C-alphas_missing   #C-alphas  and then 29 structural measures, ordered as in Table 3 in our paper below, for example
1cd1C2.pdb C 0 95   -2.2006067934   23.21.....

Note: We have not considered backbones if more than 3 C-alpha atoms are missing. This is because, Sieve connects the carbon alpha atoms it finds and big gaps in the backbone thus may give a "backbone" that is very different from what the true backbone was supposed to be.  To compute the number, #C-alphas_missing, Sieve just counts the number of carbon alpha atoms and compare this with the starting and ending residue number. In the case of pdb-files with non consecutive numbering, this may give strange results.

Citing the use of this resource: P. Røgen & R. Sinclair, Computing a new Family of Shape Descriptors for Protein Structures , J. Chem. Inf. Comput. Sci. 43, 1740--1747, 2003.

Contacting the author:
Peter Røgen  Peter.Roegen@mat.dtu.dk and
Robert Sinclair
R.Sinclair@ms.unimelb.edu.au .

Acknowledgment:
Peter Røgen was supported by Carlsbergfondet.

Bibliography
P. Røgen & R. Sinclair, Computing a new Family of Shape Descriptors for Protein Structures , J. Chem. Inf. Comput. Sci. 43, 1740--1747, 2003.