Link Twitter Facebook LinkedIn Github Email

A paired Dirichlet test

After my previous post about a Dirichlet test for flow cytometry, I realized having a “paired” Dirichlet test would be useful in some real-world situations, but I haven’t quite figured out how to do it yet.

The motivation

As a motivation, a labmate asked me to analyze some flow cytometry data using the test, and his data had some systematic trends for which a paired test might be helpful in eliminating some technical error that correlated between his control and his experimental group. What I wanted is similar to the paired t-test but for Dirichlet-distributed probability vectors instead of normally distributed real numbers.

As you might expect from the lack of software or algorithms written to do Dirichlet tests, there’s also not really anything about doing this sort of paired Dirichlet test. I’m not convinced, in this case, that such a thing is possible.

An outline of a plan

A paired t-test basically goes like this: you have pairs of control and experimental data, $(c_i, x_i)$, and since they have additive correlations you’re not interested in, you can simply take the difference $x_i - c_i$ and test whether that is consistently far enough from 0 to declare that the control condition is not the same as the experimental condition.1

For probability vectors,2 the closest analogous invertible linear transform that I can think of is a transition matrix, (a.k.a. a stochastic matrix). A stochastic matrix always transforms a probability vector into another probability vector, and has the property that each of its columns sum to 1 (i.e. each column is a probability vector).

So for a set of pairwise probability vector observations ${\mathbf{c}_i, \mathbf{x}_i}$, you would want to compute a corresponding set \[\{\mathbf{A}_i | \mathbf{A}_i \mathbf{c}_i = \mathbf{x}_i\}\] and see whether that set of matrices is sufficiently different from the identity matrix.

The problem

There are a few problems here to solve (one of which of course is defining a distribution for stochastic matrices in order to determine what is “sufficiently different from the identity matrix”), but my major roadblock right now is that there just aren’t enough constraints to calculate a unique transition matrix from two probability vectors. After all, you can think of each matrix $\mathbf{A}_i$ as having $(n-1) \times n$ independent parameters, but the equation $\mathbf{A}_i \mathbf{c}_i = \mathbf{x}_i$ only sets $n$ constraints.

Even if we assume that the matrices are all doubly stochastic matrices (that both the rows and columns sum to 1), that still gives us $(n-1) \times (n-1)$ free parameters, and for anything besides $n \leq 2$, that gives us too many parameters without enough equations to solve.

It’s a problem I’ll have to think a bit about, but as of yet I don’t know if there are any further logical constraints that I can set on the problem to make it feasible.

  1. Biologists do this quite often, actually, but with multiplicative correlation: they use the $\log_2 \left({x_i \over c_i}\right)$ (i.e. the “log fold-change”) as their data point.

  2. That is, $n$-dimensional vectors $\mathbf{v} \in \mathbb{R}^n_{\geq 0}$ where \[\sum_i {v_i} = 1,\] or in other words, with an $L_1$ norm $||\mathbf{v}||_1 = 1$. For my purposes, I’m using these vectors to model events in which one “thing” has different fractions assigned to certain categories.