Best explanation of how some probabilities can be more correct than others

Ṁ200 / 200

bounty left

Human probability judgements can obviously be bad in the sense of being biased, and not matching the conclusions drawn by an ideal Bayesian reasoner given the same information. This is not what I'm asking about.

Alice and Bob are both ideal Bayesian reasoners considering whether the new iPhone will sell at least 100 million copies. Alice knows nothing about the event, so her estimate is 50%. Bob works for Apple and knows that the new iPhone kind of sucks, so his estimate is 20%. Bob knows strictly more information, and if both of them bet on their credence, Bob will make money from Alice on expectation.

Now Alice learns all the information that Bob had about the iPhone being a poor product, but also learns that, as opposed to what had previously been announced, there's not going to be an Android release that year. Taking all this new information into account, her credence remains 50%. Now Alice knows strictly more information than Bob, and Alice is the one who will make money on expectation.

We can imagine a more complex situation where neither of them has strictly more information than the other, and presumably the one that has more total bits of information about the event is the one who will profit on expectation.

This means that when discussing a subjective probability, the probability itself is not the only meaningful number; it also matters how many bits of information were used to derive it. This feels odd.

I have no specific question, this is open-ended. Am I missing anything? Do you have an alternative way of thinking about this that might make more intuitive sense?

Math

Mathematics

Philosophy

Get

1,000

to start trading!

40 Comments

Sort by:

Bob works for Apple and knows that the new iPhone kind of sucks, so his estimate is 20%. Bob knows strictly more information, and if both of them bet on their credence, Bob will make money from Alice on expectation.

I don't agree. Bob seems to update too hard based on the inside view. That sort of thing never matters as much as we think it does (see Meehl et al.)

More generally, information is highly overvalued when it comes to accurate forecasting and probability estimations. The best forecasters use very general models and earn much of their money from avoiding overconfidence. Tetlock co-authored a paper on the BIN model, which I tried to summarise here (scroll down a bit to the relevant section): https://two-wrongs.com/improving-forecasting-accuracy-benchmark.html

The gist of it is that the future is sufficiently uncertain that adding more information quickly reaches a point of diminishing returns. However, when humans get more information beyond what improves accuracy they still become more confident. (So do statistical models too; if we are not careful in how we evaluate them, they overfit.)

The skill in assigning probabilities often lies in knowing which information to ignore, to avoid adding noise to the judgement -- noise is the real accuracy killer.

Your conclusion is mistaken. It's not bits of information used to derive the probability that we care about in determining credences.

Consider the intuition pump from Dennett regarding the intentional stance. Suppose Alice calls up Bob and tells Bob that she'll see him for lunch the next day. Bob knows Alice to be very punctual and when she makes commitments unless there's a life threatening emergency she keeps them. Bob has a >90% credence that Alice will show, with only a few bits of information, albeit, with a reasonably robust framework for understanding the world.

Omega does not understand intentionality or emergent behavior at all. He is a Laplacian Demon who simulates every microbanging of particles in the solar system as best he can. Obviously, due to quantum uncertainty, Omega can't be completely sure that any particular outcome takes place. But we're well in the realm of decoherence at this point, even if there's still the measurement problem in a technical sense. Omega has a fairly strong credence that some cluster of particles with some relationships between them (call them A) will occupy a certain set of spacetime points near some other cluster of particles with relationships (call them B). I'm not convinced that this credence is necessarily over 90%, due to the nature of the calculation, and I certainly don't think it's obviously the case that Omega has a substantially higher credence than Bob.

So Bob and Omega have reasonably similar credences in an example where you'd wish to have a high credence, and perhaps Bob's is even stronger. Omega has wildly more information in terms of bits than Bob. So it can't just be bits of information we care about.

Let's say there's 100 total pieces of information that could significantly influence an estimate. Alice's original estimate encompasses nothing. Bob's encompasses 1. Therefore, Bob's should obviously be a better estimate. However, he still has the issue of having a large standard deviation. When putting on more data points (pieces of info), Alice gets a new estimate, but more importantly restricts her standard deviation (level of possible inaccuracy). If you include 50 factors, the standard deviation should then be very small, even if the value you derive is still, say, 50%.

There is many interesting responses already, I will try to add yet another point of view.

I think that Alice and Bob should not be using Bayesian statistics to obtain probability of the event (single number), rather they should be modeling probability distribution over all possible models of the world. How narrow is this (multi-dimensional) distribution gives you the estimation of certainty or uncertainty of the probability (mean value calculated from this distribution).

In praxis, this is not doable, but we are talking about ideal agents. Let me make a very simple example to show that this claim can have a technical meaning: Let's say that there is a single true probability p of the event. Then we are in situation, when we derive posterior from likelihood that is a Binomial distribution. Thanks to conjugate priors, the posterior will stay in the form of Beta distribution over iterated updates. In your example, Alice has a distribution Beta(x, 1, 1), which is an uniform distribution, while Bob has posterior Beta(x, a, 4a) - a class of posteriors with mean value 20%. For simplicity let's say Beta(x, 1, 4). After her next update, Alice has posterior Beta(x, 4, 4) with mean value 50% again, but if you plot the distribution, it is much narrower, so her probability claims are stronger, because she more restricted the possible values of p.

In this framework, where we end up with Beta(x, a, b), number a+b-2 loosely corresponds to bits of relevant evidence. (It is "number of coin flips we have seen if this was a false coin"). This is rather a useful model than something rigorous, because in reality, the posterior would be wildly multi-dimensional over all possible models of the whole world. Imagining there is "true probability" has its limits, but it also has many practical merits. One can, for example, derive generalization of the Kelly bet to calculate how much should they be betting. This is (I think) even better than if you only input the assumed true probability, as in the original Kelly formula, because you added a notion of how sure you are about the probability.

I think this also covers the deterministic universe, we just need to have enough evidence to narrow the posterior over possible worlds sufficiently. (Not the model with single true probability, but the wildly multi-dimensional one.)

On a separate theme from my other comment, it's not that some probabilities are more correct than others -- the idea that there exists a correct probability for any real world event is flawed to begin with. Here's what de Finetti says about it:

Let us assume we have to make a drawing from an urn containing 100 balls. We do not know the respective numbers of white and red balls, but let’s suppose that we attribute equal probabilities to symmetric compositions, and equal probability to each of the 100 balls: the probability of drawing a white ball is therefore 50 %. Someone might say, however, that the true probability is not 50 %, but b/100, where b denotes the (unknown) number of white balls: the true probability is thus unknown, unless one knows how many white balls there are.
Another person might observe, on the other hand, that 1000 drawings have been made from that urn and, happening to know that a white ball has been drawn B times, one could say the true probability is B/1000. A third party might add that both pieces of information are necessary, as the second one could lead one to deviate slightly from attributing equal probabilities to all balls.
A fourth person might say that he would consider the knowledge of the position of each ball in the urn at the time of the drawing as constituting complete information (in order to take into account the habits of the individual doing the drawing; their preference for picking high or low in the urn): alternatively, if there is an automatic device for mixing them up and extracting one, the knowledge of the exact initial positions which would allow one to obtain the result by calculation (emulating Laplace’s demon.)
Only in this case would one arrive, at last, at the true, special partition, which is the one in which the theory of probability is no longer of any use because we have reached a state of certainty. The probability, “true but unknown” of drawing a white ball is either 100 % or 0 %.

Probability is not a way to assign numbers to events, it is a tool to reason about combinations of those numbers. As a silly example, if Bob believes heads has a 53 % probability and tails a 57 % probability, Alice will set up a Dutch book and fleece Bob.

"Yeah but nobody reasonable would believe like Bob." Hah! Of course not in this simple scenario, but when you make the situation more complex people assign incoherent probabilities all the time, because they don't even realise all of the things are related and need to cohere.

Imagine that Bob didn't know heads+tails=1, because to him they look like two completely different coin flips, and God decides for each one how it's going to land, and God does not abide by silly mathematical rules. It is then completely permissible for him to assign 53 and 57 because under the assumption that the events are unrelated all assignments are coherent. That's the mistake you and I do in more complicated situations.

Good forecasters, from this point of view, have a knack for assigning coherent probabilities. Not by thinking through every alternative deeply, but because they have trained an intuitive sense for what somewhat coherent probabilities feels like, along with the skill of figuring out which the most important alternatives for a rough study are.

Frame challenge: this may look odd because you are oversimplifying the real-world problem in at least two important ways: 1) "ideal Bayesian reasoning" is not always the best way to make judgements for any purpose and 2) probabilities are not derived directly from "information", but are instead produced by a combination of information and a model of the world. Once we accept the greater complexity of the real-world problem, I think the oddity disappears - any probability claim needs to be evaluated with a lot of context in mind and the amount of information used just becomes another piece of that context, accompanying the modelling assumptions made, the purpose we want to use the estimate for etc.

I think point 2 is kind of obvious (I can expand if needed), but let's expand on point 1 a bit, because that appears to be harder to get across: The claim is that purely Bayesian reasoning has problems in some cases. Here's an example I like: Assume the typical "biased coin" scenario. You learn that person X reported throwing a coin 10 times, and it landed heads 9 times. You run a Bayesian model and produce some probabilities. You then learn that person X has actually been throwing many different coins 10 times and would only disclose the result if they got at least 9 heads. It is a theorem of Bayesian statistics that none of your probabilities should change after learning this information. (this is an instance of the Likelihood principle - https://en.wikipedia.org/wiki/Likelihood_principle).

The problem here is subtle - if you truly believe that your prior distribution on the coin probabilities matches the distribution person X draws their coins from, there is actually no problem and any betting-like behaviour on your part will be valid (despite this being a bit counterintuitive). But if the prior is mistaken you'll run into problems (e.g. in the extreme case X just throws the same coin all the time and 9 or more success provide no information). OK, sounds like good old model misspecification, what's the deal?

The deal is that no priors real people use are designed to exactly match their beliefs/background information, because this is almost impossible to do. You'd usually just assume some rough approximation and err on the side of wider priors. This mismatch doesn't matter in the typical case as you can argue that the effect of slightly mistaken priors will be very limited for most real situations. However, in the face of selective reporting as described above, a minor prior misspecification may actually not disappear with growing dataset size.

Frequentist reasoning lets you dig yourself out of that hole without needing to model X's distribution explicitly. But frequentism isn't perfect either and has other deep problems that Bayesians have been pointing out for some time and which I presume you are familiar with. I personally like the Bayesian approach (and do research on Bayesian methods), but I am quite sure it is not a panacea. I don't think there is any single approach to probability that works well in all situations. Epistemology is not solved and you need to consider your aims and the type of information you have before you decide how to compute a probability.

Here's an interesting discussion about this that I don't fully understand:

https://www.lesswrong.com/posts/MquvZCGWyYinsN49c/range-and-forecasting-accuracy

Setting aside the thought experiment, there’s the question of why the information known is rarely mentioned when talking about subjective probabilities and/ or Kelly betting. I think that it isn’t important (see Najawin’s comment) unless these two factors, or maybe others I haven’t thought of, come up:

Adversarial situations where either you are betting against someone (rather than against the universe) who has any information that you don’t, or where someone is selectively telling you true information in order to change your probability estimate of something. In both these situations, you may want to bet under- Kelly or even not at all. However, you might also or instead be able to include your knowledge of this situation in your subjective probability.
Situations where you can gather more information before you bet. It’s probably more worthwhile to gather information the less of it you have. Exactly how much information you should gather before betting could be in an expanded version of the Kelly criterion.

Thinking further: As the number of bits the agent has access to tends to infinity, the agent's credence must go to 0 or 1. But with finite information, gaining more information can send the agent's credence towards 50%. So it's not monotonic, and it probably never becomes monotonic above a certain threshold. But what determines whether the next bit will send the probability up or down? It must depend on what part of the universe that bit is about, which depends on the specific information stream the agent has access to. Two different agents with different information streams about the same event must eventually converge to the same extreme, but in the mean time they could end up having wildly different probabilities. If they place bets based on the same number of bits, will they have the same average winnings doing this multiple times, even if they have different probabilities every time? I guess they must.

How does all this relate to the value of information of each new bit? Not sure.

This feels odd.

I'm not seeing the oddity. Isn't it totally natural that the prediction made with more information is the more reliable one?

Related questions