Bayesian Trees

During my master’s in Statistics, I worked on understanding and analyzing the prior on Bayesian Trees as per Chipman et al. (1998). This tree prior has been used in the creation of bayesian additive treestreed gaussian processes, and a few other models. My Phd thesis is not very accessible because it is very long and in French but my main findings are listed here:

This tree prior is constructed using a recursive tree generating process where we start with an empty tree and each node has a probability of splitting in two children nodes [p_split = alpha*(1+depth)^beta]. When all nodes have failed to split, the tree has been generated.

1) It was originally thought that the first hyper-parameter (alpha) controls the size and second hyper-parameter (beta) controls the depth of the trees but this is not quite true. Both alpha and beta control the size (increasing alpha or lowering beta leads to bigger trees) so there are infinite combinations of alpha/beta that lead to the same tree size but the interesting thing is that high alpha/beta lead to trees with less variance than those from small alpha/beta! What’s annoying about this though is that one cannot just select a parameter for size and a parameter for the variance, one must tweak both parameters carefully to get the desired size and variance.

2) It was originally thought that the prior did not depend on the dataset but it does. The smaller the dataset is and the more categorical variable the dataset has, the smaller the trees are going to be. This means that even if you knew that alpha=.5, beta=2 led to trees with an average of 6 nodes with 1 standard-deviation using a certain dataset, it might lead to an average of 8 nodes with 1.5 standard-deviation using another dataset.

3) Another problem arises with this tree prior; alpha must be smaller than 1 (otherwise, for certain nodes, the probability of splitting a node would be > 1) which restrict how small you can make the variance of tree size be. Therefore, you cannot get trees with tight confidence intervals. For example, if you suspected that the “real” tree had between 5 and 7 nodes, you might want to put a 95% interval on the nodes of [5,7] but this is impossible to set with the current prior definition. Furthermore, trees with 0 nodes always have the highest prior probability. In my master thesis, I found a simple way to remove this restriction and make the prior distribution less concentrated around 0. Instead of using p_split = alpha*(1+depth)^beta, you can simply use p_split = min(alpha*(1+depth)^beta, 1). This simple solution makes it possible to use alpha bigger than 1 and thus make trees with any desired size and variance.

This shows that this prior for Bayesian trees has important issues with regards to selecting the hyper-parameters and in terms of flexibility. In it’s current-form, this prior is weak, i.e. with enough data the influence of the prior will become relatively unimportant. In smaller samples though, the prior can still be impactful and using a strong reasonable prior (which is possible using my alternative definition of p_split) could greatly improve predictions.

An R package is available on CRAN to simulate from the tree prior. Note that if you set alpha > 1 in this package, you’ll be able to see the kind of distributions that p_split = min(alpha*(1+depth)^beta, 1) would provide if it was implemented in other Bayesian tree packages.