# The Dirichlet distribution for cell-cycle flow cytometry statistics

• Category: ScienceI recently had to analyze some cell cycle flow cytometry data in the form:

Condition | G_{0}/G_{1} | S | G_{2}/M |
---|---|---|---|

A | 5432 | 142 | 503 |

B | 2004 | 501 | 142 |

A | 4014 | 234 | 599 |

B | 3010 | 503 | 120 |

... |

Surprisingly, there isn’t a lot of software written to do this correctly, so I wrote my own, which you can install from my github account using the following command:

```
pip install git+https://github.com/ericsuh/dirichlet.git
```

If you want to see if the above data from the two conditions A and B are different from each other, you might think to try is a $\chi^2$ or a G-test, but both of these tests, at least out of the box, test the wrong kind of variation.

To a biologist, the numbers above (if that was all the data there is) would
indicate a sample size of 4, whereas to the $\chi^2$ or G-test, the sample size
is on the order of the sum of the rows. In other words, the biologist is
interested in how large the variation is from experiment to experiment, but the
$\chi^2$ or G-tests would give back how much variation there is *within* a
particular experiment.^{1} If you aren’t careful, you can get ridiculous
*p*-values like $10^{-300}$. In comparison, the number of atoms in the
universe is somewhere near $10^{80}$, so a test giving $p < 10^{-80}$
essentially says that an experiment is *so* unlikely under the null hypothesis
that it would *never* happen to you in the history of the universe. Anyone
who has worked at the lab bench can agree this is monumentally not true.

The alternative to the $\chi^2$ or G-tests is a Dirichlet likelihood ratio
test. The main idea is that when you divide each row of the above
data by its sum to give percentages or fractions, those numbers can be modeled
by a Dirichlet distribution, which is one of the few probability
distributions defined over the numbers ${x_0, \ldots, x_K}$ where for all $i$,
$x_i \in [0,1]$ and $\sum_i x_i = 1$.^{2} This is perfect for data
sets like the fractions of cells in different cell cycle phases.

Unfortunately, there’s no easy solution to the equation to fit a Dirichlet model to data, so the Python package I wrote uses a methods outlined by Thomas Minka in his paper and his MATLAB code. You can then see if the fit with two different Dirichlet distributions is sufficiently better than fitting just one Dirichlet distribution to justify claiming that you have two different data sets.

At some point, I will be making a web interface for the code so that the
non-technical folks in my lab (1) don’t have to learn Python, and (2) even if
they learn Python, don’t have to go through the hell that is configuring and
installing `numpy`

and `scipy`

.