# The Dirichlet distribution for cell-cycle flow cytometry statistics

I recently had to analyze some cell cycle flow cytometry data in the form:

Condition G0/G1 S G2/M
A 5432 142 503
B 2004 501 142
A 4014 234 599
B 3010 503 120
...

Surprisingly, there isn't a lot of software written to do this correctly, so I wrote my own, which you can install from my github account using the following command:

pip install git+https://github.com/ericsuh/dirichlet.git

If you want to see if the above data from the two conditions A and B are different from each other, you might think to try is a χ2 or a G-test, but both of these tests, at least out of the box, test the wrong kind of variation.

To a biologist, the numbers above (if that was all the data there is) would indicate a sample size of 4, whereas to the χ2 or G-test, the sample size is on the order of the sum of the rows. In other words, the biologist is interested in how large the variation is from experiment to experiment, but the χ2 or G-tests would give back how much variation there is within a particular experiment.1 If you aren't careful, you can get ridiculous p-values like 10−300. In comparison, the number of atoms in the universe is somewhere near 1080, so a test giving p < 10−80 essentially says that an experiment is so unlikely under the null hypothesis that it would never happen to you in the history of the universe. Anyone who has worked at the lab bench can agree this is monumentally not true.

The alternative to the χ2 or G-tests is a Dirichlet likelihood ratio test. The main idea is that when you divide each row of the above data by its sum to give percentages or fractions, those numbers can be modeled by a Dirichlet distribution, which is one of the few probability distributions defined over the numbers x0, …, xK where for all i, xi ∈ [0, 1] and ixi = 1.2 This is perfect for data sets like the fractions of cells in different cell cycle phases.

Unfortunately, there's no easy solution to the equation to fit a Dirichlet model to data, so the Python package I wrote uses a methods outlined by Thomas Minka in his paper and his MATLAB code. You can then see if the fit with two different Dirichlet distributions is sufficiently better than fitting just one Dirichlet distribution to justify claiming that you have two different data sets.

At some point, I will be making a web interface for the code so that the non-technical folks in my lab (1) don't have to learn Python, and (2) even if they learn Python, don't have to go through the hell that is configuring and installing numpy and scipy.

1. That is, the χ2 and G-tests measure the "technical" variation.

2. It's a multidimensional form of the Beta distribution, as defined on that simplex.