Skip to main content
MANIFOLD
Will I find a robust causal discovery approach that works with my data?
13
Ṁ335Ṁ97
resolved Mar 1
Resolved as
90%

I am running standard causal discovery algorithms (PC and FCI) on a dataset of astronomical data. I find that re-running the algorithms on a random subsample of my data changes the results too much for my taste. This is due to conditional independence tests failing (not enough data) but also probably to outliers. Will I find a robust causal discovery algo that makes me feel confident that the DAG I am learning from my data is genuine and does not depend too much on random fluctuations and outliers?

This can either be a new algo (either new to me or to the research community in general) or even just a hack (pooling results, etc.).

For reference see https://arxiv.org/abs/2311.15160

Subjective, won’t bet.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ11
2Ṁ11
3Ṁ8
4Ṁ3
5Ṁ3
Sort by:

DAG-GFN and some bootstrapping plus outlier removal did the trick.

Starting to play with this https://arxiv.org/pdf/2202.13903.pdf

Well it turns out that the worst issue is likely not outliers but the small sample size. Hard to choose alpha (the cutoff for significance) for the independence tests in a principled way. Even sticking with testing for linear correlation to avoid having additional hyperparameters to set for the test we end up learning very different causal structures by changing alpha from say 0.01 to 0.05. In principle one could try to balance the probability of getting type I errors with that of getting type II errors (minimizing 1 - (1 - P_I)(1 -P_II)) but estimating P_II (the probability of getting at least one type II error over all the tests) requires assumptions on effect size. Those end up as arbitrary as choosing alpha.

Maybe a Bayesian approach where we start with a prior probability on the space of DAGs and get a posterior would work better. After all when the hypothesis testing point of view lands you in trouble the option to ditch it and go full Bayesian seems only reasonable

That's some really cool research, I had no idea that causal inference is being used in astronomy!

I reviewed this paper, and found some potential ideas:

GFCI (Ogarrio et al., 2016), a combination of GES and FCI, using GES to find a supergraph of the skeleton and FCI to prune the supergraph of the skeleton and find the orientations. GFCI has, however, proved more accurate in many simulations than the original FCI algorithm.

The Two-Step algorithm and the FASK algorithm (Sanchez-Romero et al., 2019) are two examples of procedures that use adjacency searches to provide an initial undirected directed graph which the algorithms then prune, refine, or extend.

Section 7 seems relevant as well. Anyways it's a little bit hard to help because the solution might depend a lot on the characteristics of your dataset. It would help if you can share some info, like what are the variables involves and what is the sample size.

@Shump Indeed it’s the first time ever that causal discovery is applied to astronomical data (to the best of my knowledge). I will take a look at the paper you suggest, thank you!

predictedNO

I doubt it will solve your particular problem, but are you aware of this? https://arxiv.org/abs/1803.01422

@DavidJohnston I wasn’t aware of this, thank I will look into it!

Not sure if this is applicable to your problem, but could you draw bootstrap samples instead of subsampling to do validation?

@Thomas42 I can, in principle

the problem is in general instability. Let’s say I resample with replacement so I get the same number of points and the conditional independence tests don’t suffer from not having enough data. I suspect that I will still get very different PDAGs with different runs of the resampling.

predictedNO

@mariopasquato Yea that's definitely a possible outcome, which I think should lower your confidence in the point estimate you got on the full dataset? Conversely, if it doesn't happen, it would be a good sign?

@Thomas42 Exactly

Is it right that the timeframe of this question is year 2024 (I'm assuming by the close date of the market). So in the beginning of 2025 it will be for sure resolved?