
I am running standard causal discovery algorithms (PC and FCI) on a dataset of astronomical data. I find that re-running the algorithms on a random subsample of my data changes the results too much for my taste. This is due to conditional independence tests failing (not enough data) but also probably to outliers. Will I find a robust causal discovery algo that makes me feel confident that the DAG I am learning from my data is genuine and does not depend too much on random fluctuations and outliers?
This can either be a new algo (either new to me or to the research community in general) or even just a hack (pooling results, etc.).
For reference see https://arxiv.org/abs/2311.15160
Subjective, won’t bet.
🏅 Top traders
| # | Trader | Total profit |
|---|---|---|
| 1 | Ṁ11 | |
| 2 | Ṁ11 | |
| 3 | Ṁ8 | |
| 4 | Ṁ3 | |
| 5 | Ṁ3 |
Well it turns out that the worst issue is likely not outliers but the small sample size. Hard to choose alpha (the cutoff for significance) for the independence tests in a principled way. Even sticking with testing for linear correlation to avoid having additional hyperparameters to set for the test we end up learning very different causal structures by changing alpha from say 0.01 to 0.05. In principle one could try to balance the probability of getting type I errors with that of getting type II errors (minimizing 1 - (1 - P_I)(1 -P_II)) but estimating P_II (the probability of getting at least one type II error over all the tests) requires assumptions on effect size. Those end up as arbitrary as choosing alpha.
Maybe a Bayesian approach where we start with a prior probability on the space of DAGs and get a posterior would work better. After all when the hypothesis testing point of view lands you in trouble the option to ditch it and go full Bayesian seems only reasonable
That's some really cool research, I had no idea that causal inference is being used in astronomy!
I reviewed this paper, and found some potential ideas:
GFCI (Ogarrio et al., 2016), a combination of GES and FCI, using GES to find a supergraph of the skeleton and FCI to prune the supergraph of the skeleton and find the orientations. GFCI has, however, proved more accurate in many simulations than the original FCI algorithm.
The Two-Step algorithm and the FASK algorithm (Sanchez-Romero et al., 2019) are two examples of procedures that use adjacency searches to provide an initial undirected directed graph which the algorithms then prune, refine, or extend.
Section 7 seems relevant as well. Anyways it's a little bit hard to help because the solution might depend a lot on the characteristics of your dataset. It would help if you can share some info, like what are the variables involves and what is the sample size.
@Shump Indeed it’s the first time ever that causal discovery is applied to astronomical data (to the best of my knowledge). I will take a look at the paper you suggest, thank you!
I doubt it will solve your particular problem, but are you aware of this? https://arxiv.org/abs/1803.01422
@mariopasquato Yea that's definitely a possible outcome, which I think should lower your confidence in the point estimate you got on the full dataset? Conversely, if it doesn't happen, it would be a good sign?