sábado, mayo 16, 2009

Algorithms for phylogenetics 0b: Trees


Another basic component of phylogenetic analysis' programs are the trees. Trees can be the result of an analysis (as in cladistic analysis' programs) or part of the data used (as in biogeography analysis' programs).

Trees are one of the most well known data structures of computer science. In computers science a tree is a collection of elements (nodes), nodes are related by a "kindship" relation that imposes a hierarchy on the nodes. Kinship is a pairwise relationship: one node is the ancestor and the other the descendant. A node can have an indefinite number of descendants, but, at maximum, only one ancestor.

In general, in phylogenetics, the term node is used only for elements with descendants, and terminal of leafs to elements without descendants (in computer science the used terms are internal node and terminal node, or leaf node. Here I will use the terminology from computer science, and when using node I refer to any element of the tree).

A trees has at maximum a single root node. A root node is an internal node without ancestor. Different from most abstract trees of computer science, the only elements with have the user data are the terminals (from the data matrix). Data stored by internal nodes is inferred through algoritmic means.

There are many ways to represent a tree. For me, the most natural one is using structures and pointers. Here is the basic node
typedef struct tagPHYLONODE PHYLONODE;

struct tagPHYLONODE {
PHYLONODE* anc;
PHYLONODE* left;
PHYLONODE* right;
STATES* recons;
};
This structure is a node from a binary tree, that is, each internal node have two descendants, the pointers left and right, note that they are pointers to PHYLONODE elements. The pointer to ancestor is anc. In this structure, the root node has anc as NULL (i.e. points to no elements), and terminal nodes has left and right as NULL. The array recons store the data (as bitfields).

I usually store the information from a terminal (extracted from data matrix) into an independent structure, for example
typedef struct tagTERMINAL TERMINAL;
struct tagTERMINAL {
char name [32];
STATES* chars;
};
And include a pointer to TERMINAL as part of PHYLONODE
struct tagPHYLONODE {
... //Otros campos aquí
TERMINAL* term;
};
Internal nodes has term as NULL.

A nice property of trees, is that subtrees have the same properties of a tree, then recursive algorithms a very natural way to work with trees. Nevertheless, I always prefer to have an independent structure to store the whole tree
typedef struct tagPHYLOTREE PHYLOTREE;
struct tagPHYLOTREE {
unsigned numNodes;
PHYLONODE* root;
PHYLONODE* nodeArray;
DATAMATRIX* matrix;
};
This structure store a pointer to the root node (root), an array to the nodes of the tree (nodeArray), a pointer to a data matrix (matix) and the number of nodes.

This approach has the advantage of having several independent sets of trees, each tree with his on characterisitcs (id, name, data matrix), that are no properties of any node, but of the whole set of nodes (A typical case: in biogeography, several cladograms form different organism are used).

Different from structures it is possible to store trees as coordinate arrays. This option is used in [1] for the example code, and it is also the way in which TNT macro language uses trees (the language does not support structures). The declarations of arrays is simple
int* anc;
int* left;
int* right;
STATES** chars;
In this case, all arrays are dynamic (they are declared as pointers, but as soon as they are assigned, they can be treated as arrays), and instead of have a pointer, they store the number that is the index of the array. As it is a coordinate array, every element with the same index is the same node. For example anc [17] = 20, assigns the node 20 as the ancestor of node 17. In this case, left [20] or right [20] must be equal to 17. chars [20] has the sates assigned to node 20. If you look at the tree entry in wikipedia (specially the entry on heaps) you will find that is the way used to keep trees.

For me, the bad side of that approximation is that requires an strict bookkeeping (take note that in TNT macro language, TNT automatically updates the information, so that is not a problem!). This is my own experience, I worked with tree reconciliation, and work with someone that does not know working with pointers. So I try a coordinate array approach. But we not know how many nodes would be on the tree after reconciliation, then the code turns messy and extremely difficult to update.

That experience does not show if some alternative is best than the other, just that sometimes the nature of the problem might constraint the way to resolve it, and that some styles of programming are better than others for some programmers ;). In this series I always will use structures.

Just to have some code, and to mark the recursivity of trees, here are a function to store a tree in a parenthetical notation (as in Hennig86, NONA and TNT)
#include < stdio.h >

void PrintTree (FILE* out, PHYLOTREE* tree) {
fprintf (out, "tread \'Simple tree output\'\n");
PrintNode (out, tree - > root)
fprintf (out, ";\np/;");
}

void PrintNode (FILE* out, PHYLONODE* node) {

if (node - > term != NULL) {
fprinf ("%s ", node - > term - > name);
return;
}
fprintf (out, "(");
PrintNode (out, node - > left);
PrintNode (out, node- > right);
fprintf (out, ") ");
}
This functions used standard IO operations. The function PrintTree puts the tree header for Hennig86/NONA/TNT trees. And call PrintNode on the root node. PrintNode checks if the node is a terminal, if this is true, it writes the terminal name, if not, the node is an internal one, so it prints the parenthesis and call recursively PrintNode on each one of its descendants.

I think that some time ago Mike Keesey write a post about trees under a object oriented paradigm. I'm unable to find the post, this one is the most closely alternative :P.

References
[1] Goloboff, P. A. 1997. Self weighted optimization: tree searches and character state reconstruction under implied transformation costs. Cladistics 13: 225-245. doi: 10.1111/j.1096-0031.1997.tb00317.x

Previous post on Algorithms for phylogenetics
0a. Bitfields

2 comentarios:

Unknown dijo...

Hi Man,

Good blog, you're one of good researchers that use the science divulgation! Congratulations.

Try more posts,
Allysson

Salva dijo...

Thanks for the comment! :)