viernes, abril 24, 2009

Algorithms for phylogenetics 0a: Bitfields


Intro

I want to start a series of post about the algorithms used in phylogenetic analyses. I feel a bit disappointed with the entry on computational phylogenetics (and related subjects) in wikipedia, that are somewhat biased towards model-based methods and "bioinformatics", with a poor representation of algorithms for phylogenetics.

There are two outstanding examples of algorithms for phylogenetics on the web. The book of Wheeler et al. [1], and, in a similar vein, a manuscript by DeLaet [2]. But their presentation of the algorithms is somewhat general, that is be nice for educational purpouse, but in some cases, far away from the actual implementation in computer software.

As the presentation of the algorithms in [1] and [2] is excellent, I want to focus on the implementation. The idea is to go from simpler to more complex algorithms (as far as I can comprehend them xD).

I want to acknowledge Pablo Goloboff. He always have the time to discuss with me about algorithms (and any topic on methodological phylogenetics) :D. Of course, any error in my implementations are my own errors!

As this series progress, I will post the source code on google source or sourceforge, if someone wants to check it, requires an knowledge of C programming language. Thanks to Rubén who showme some tips with the HTML :).

Bitfields

Characters are a basic component of phylogenetic analysis. For simplicity assume that we only have a single character in each terminal. Each character can be stored into a variable, assigning to them 0, 1, 2... as indicated by the character state. This idea is implicit in the original description of Wagner's "groundpland divergence" [3], and in Farris' formalization [4, see 5]. Then operations between characters would be operations of ranges. But it is more simpler to think about characters as a set of states [4, 6]. Then operations between characters would be set operations. This set has a peculiar property: they are perfectly boolean, that is, a taxon has or has not a particular state. This can be translated in two positive consequences: (i) character states can be stored as a bitfield: (ii) character operations are bifield operations.

Computers store numbers as binary numbers. A bitfields use this characteristic to represent a set of positioned bits. This example shows it better:
State    binary representation       "Number" (human)
0 0001 1
1 0010 2
2 0100 4
3 1000 8
0, 1 0011 3
1, 3 0101 5
As can be seen, each state has his on bit position, and the states combination is just the union of its states. Many programming languages include operations to work directly on bits (not, and, or and xor), then the translation from set operations to bit operations is direct.

In this series I will use char to store character states. In C, char has 8 bits, then only 8 states can be stored. I define the bitfield with a typedef, so changing the number of bits on the bitfield (and then the number of states) is easy.
#define BITS_ON_STATE 8
#define ALL_BITS_ON 255
typedef char STATES;
This small example reads the caracters of a taxon, from a file (in), and store it into an array of characters (chars). The number of characters is nc, and the functions getc (in stdio.h) reads a single character from the file, SkipSpaces ignores the blank characters, and isdigit (ctype.h) detects if the character is a number.
for (j = 0; j < nc; ++ j) {
error = SkipSpaces (in);
if (error != NO_ERROR) return error;
c = getc (in);
if (isdigit (c))
chars [j] = 1 < < (c - '0');
else if ((c == '?') || (c == '-'))
chars [j] = ALL_BITS_ON;
else return BAD_FORMAT;
}
The operation chars [j] 1 < < (c – '0') substract the ASCII value of '0' (30) from the ASCII value stored in c (a number, so it is form 30 to 39). The resulting value is used to shift the bits of 1. For example if a '0' is readed the operations are:
chars [j] = 1 < < (30 – 30)
chars [j] = 1 < < 0
[0000 0001 < < 0 = 0000 0001]
chars [j] = 1
If a 5 is readed
chars [j] = 1 < < (35 – 30)
chars [j] = 1 < < 5
[0000 0001 < < 5 = 0010 0000]
chars [j] = 32
The 1 is shifted five bits to the left.

Note that inapplicables and unknowns are coded with all bits on.

This work nicely with non additive characters. For additive characters, it is possible to “recode it” as several binary characters [7][8], then it is possible to use the character as independent binary characters.

There are more sophisticated usages of bitfields, packing several characters in a single variable (for example, eight 4-bit characters can be stored into a singel 32-bits variable) [8][9]. This packing allows an speed up during searches, because several characters can be examined simultaneously.

References
[1] Wheeler, W. et al. 2006. Dynamic homology and phylogenetic systematics: a unified approach using POY. American Mus. of Natural History, published in cooperation with NASA. online: http://research.amnh.org/scicomp/pdfs/wheeler/Wheeler_etal2006b.pdf
[2] DeLaet, J. 2005. Pseudocode for some tree search algorithms in phylogenetics. Manuscript online: http://www.plantsystematics.org/publications/jdelaet./algora.pdf
[3] Wagner, W.H. 1961. Problems in the classification of ferns. In: Recent advances in botany. Toronto Univ. Press. pp. 841-844.
[4] Farris, J.S. 1970. Methods for computing Wagner trees. Syst. Zool. 19: 83-92.
[5] Goloboff, P.A. et al. 2006. Continuos characters analyzed as such. Cladistics 22: 589-601. doi: 10.1111/j.1096-0031.2006.00122.x
[6] Fitch, W.M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20: 406-416.
[7] Farris, J.S. et al. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19: 172-189.
[8] Moilanen, A. 1999. Searching for most parsimonious trees with simulated evolutionary optimization. Cladistics 15: 39-50. doi: 10.1111/j.1096-0031.1999.tb00393.x
[9] Goloboff, P.A. 2002. Optimization of polytomies: state and parallel set operations. Mol Phyl. Evol. 22: 269-275. doi: 10.1006/mpev.2001.1049

2 comentarios:

Mike Keesey dijo...

Interesting post. It occurs to me that a taxonomy could be stored in the same manner.

Salva dijo...

Thanks Mike!

A similar approach (not exactly to taxonomies, it is for tree shapes) was used by Farris [Syst. Zool. 22: 50, 1973] to compare trees.

The problem with this, is that as the tree grows, the number of redundant fields increase (if you have a bit for each mammal family, each other branch of the tree-of-life have this bits always in 0).

A guess that if the meaning of the low-order bits depends on the high-order bits, a bunch of memory can be saved, but surely it require a more complex comparison mechanism.

For example (((a b) (c d))((e f)(g h))) can be stored as
a 110 000
b 110 000
c 101 000
d 101 000
e 000 110
f 000 110
g 000 101
h 000 101
Or taken into account the high order bits, it is reduced to
a 10 10
b 10 10
c 10 01
d 10 01
e 01 10
f 01 10
g 01 01
h 01 01

Thinking about it, this is very taxonomy-like! (such as Brachiopoda: Craniata; Chordata: Craniata)