The truth is they both make tradeoffs that can appear ridiculous. In fact, the criticisms of confidence intervals and p-values apply almost exactly, in transpose, to credibility intervals and posterior probabilities.
Confidence intervals and p-values are a worst-case technique. The p-value will always control the false positive rate below alpha, even in the worst input. Sometimes you do want this -- e.g. when we say that the worst-case runtime of QuickSort is O(n^2), that's useful,
even if we do have a prior distribution over the inputs and could also say that the expected runtime is O(n log n). But the errors are correlated across observations. You can have a valid "95%" confidence interval that always produces total nonsense when the experiment ends up with output x, as long as x happens <5% of the time for all possible inputs.
Credibility intervals and posterior probabilities are an average-case technique, where we integrate over the prior. Even if the prior is correct, the errors are correlated across inputs, which can be a problem. In the cookie-jar example at stackexchange, the 70% credibility interval is "wrong" 80% of the time when the jar is type B. That means if you send out 100 "Bayesian" robots to assess what type of jar you have, each robot sampling one cookie, you will expect 80 of the robots to get the wrong answer, each having >73% posterior probability in that wrong conclusion! That's a problem, especially if you want most of the robots to agree on the right answer. The two methods just make different tradeoffs in the way they quantify uncertainty.
My quibble with the cartoon, though, was that it's not really about the frequentist vs. Bayesian debate. If you want to decide whether to take action (like shuttering a satellite) in response to a "YES" output from the instrument, everybody will agree that you need to calculate {rate of events} * {false negative rate} * {cost of false
negative} and compare it with {1 - rate of events} * {false positive rate} * {cost of false positive}.
The frequentist agrees with this math, the Bayesian agrees with this math, and the math doesn't even use Bayes' rule. This is basic actuarial science or decision theory.
The frequentist might do the mechanics in a certain way. They may say they are first going to calculate a p-value, and then ask whether the p-value is less than a threshold alpha, where alpha was set based on the costs and rate of events in order to control the "false discovery rate." And then take action only on a "significant" result.
The Bayesian might do the calculation a little differently too; they could say they are first going to use the expected rate of events as a prior, then calculate the conditional probability that there has been an event (given the instrument's reading), and then multiply this posterior probability by the cost of false negative, and its complement by the cost of false positive, to decide which action has lower expected cost.
But both the frequentist and Bayesian will get the same answer and end up with the same result as somebody who evaluates the inequality above directly. I don't think any technique has a monopoly on the correct answer here.
I personally think that both groups are doing it wrong. I do a lot of A/B testing. In A/B testing what you care about is this:
1. Not getting a horribly wrong answer.
2. Getting an answer quickly.
A frequentist can tell me how to avoid getting an answer, but has no idea of the fact that below some threshold of data any answer is likely to be chance, and the errors that I can make are severe. And conversely if I have enough data, I'm likely to find real answers, and the mistakes I am likely to make are acceptable.
A Bayesian can tell me - in principle - that there is a threshold below which I should be cautious of making decisions and a threshold above which I can make decisions more easily. But naive priors set those thresholds too low, and I do not have sufficient data to come up with a real prior to use. I could create a conservative prior, but it would be hard to explain to anyone what I am doing.
In practice I've found that it is effective to blend the approaches. I compute frequentist statistics, but based on my past experience and knowledge of the ease of making severe errors, I insist on very high confidence levels for low amounts of data, and much lower for high amounts of data. Based on some numerical simulations, my true error rates seem acceptably low, and experiments run acceptably quickly.
(If I did not care about the speed of testing, then I could just set a rule like, "Go with the first version to get 296 conversions ahead." If the two versions have conversion rates that differ by 1%, then 95% of the time I will get the right answer. If the difference is larger, I will get the answer even more often. If the difference is smaller I will get the wrong answer more often - but the errors that come will be small and on average I'm still making good business decisions. All of the complex stats I actually do are just about getting answers quickly without compromising how often, on average, I make bad business decisions.)
The posterior distribution is only half the story. A true Bayesian uses a utility function to make decisions. How much you care about the worst case vs the average case is in the utility function. That's exactly the problem with frequentist methods: you're still making assumptions, but they are implicit and hardcoded in your choice of method, instead of explicitly stated and tweakable. With a particular choice of prior and utility function, you can recover many frequentist methods, but in most cases those will not be the prior and utility function you actually want. For example maximum likelihood estimation corresponds to a utility function equal to the likelihood (i.e. the probability mass), which at the very least should strike you as ridiculous for continuous quantities (maximum likelihood can still be useful as an approximation technique if the problem is intractable with your actual utility function). With a frequentist method, you are using a prior, you just don't know which one.
For some problems you might be able to get the correct decision in a very roundabout way by setting your alpha to the right magic value, but (1) it's not clear how to find the right alpha and (2) in general you cannot encode a complete utility function into a single number.
> With a frequentist method, you are using a prior, you just don't know which one.
I don't think so. Look over the cookie-jar example. The confidence interval guarantees worst-case coverage at least equal to its confidence parameter, for all values of the parameter. The credibility interval gives average-case coverage, integrated over the prior.
The confidence interval gives guaranteed coverage for every value of the parameter (conditioned on each possible input value). The credibility interval includes enough mass in the conditional probability function, conditioned on each possible output observable.
These are different mathematical objects and they do different things. The confidence interval doesn't use a prior over input values; it is giving you guaranteed coverage for any input value.
Let me put it this way: if you think the frequentist method is using some prior, what choice of prior will make the 70% credibility intervals in the cookie-jar case be identical to the 70% confidence intervals?
Anybody can think about utility to make decisions; it's not unique to Bayesian methods. Statisticians and engineers have been calculating ROC curves and choosing operating points on the ROC frontier (based on cost/benefit analysis) since World War II.
I meant that in the context of making a decision. The point of statistics is to make decisions. For example you want to know whether a medical treatment works so that you can decide whether or not to give it to people. So you do a hypothesis test to see whether the treatment works better than a placebo, and then if the p-value is small enough you give it to people, and otherwise you don't. Instead of explicitly separating the assumptions (prior & utility) from the logical deduction, the assumptions are embedded in this procedure. Why would the assumptions implicitly made by the choice of procedure be the assumptions you want to make? You take the answer to a question that's irrelevant to the decision, namely "given that the treatment doesn't work, how likely is the data" and try to tweak a decision based on that. There is no principled way to make decisions based on that information.
Credibility intervals are about average-case coverage, but Bayesian statistics as a whole is definitely NOT just about average case. In general the utility `U` is a function of the decision `d`, and of the posterior knowledge you have of the world `P`. In many practical cases the utility might be the expected profit: U(d,P) = integral(profit(x,d)P(x)dx). But it certainly doesn't have to be. If you are risk averse you might choose your utility as U(d,P) = min_x profit(x,d) to ensure that your utility is the minimum profit you make given a decision, rather than the average. Another example is U(x,P) = P(x) which gives maximum likelihood estimation. Making a decision based on a hypothesis test can also be emulated with a utility function. Suppose the hypothesis is H and we make decision d1 if p-value > alpha and d2 if p-value < alpha. We choose a prior P(I) that makes each possible observed data set I equally likely, and we choose the utility function that reverses Bayes' rule to compute P(I|H) to make a decision based on that:
where brackets are indicator notation. Note that P(I) appears to access the data set which the utility function does not have access to, but recall that P(I) is constant regardless of the measured data. Of course not many people have such a prior and utility function...so it doesn't really make sense to hard code them into the method.
In general the process works like this. Given prior P and utility U and measured information I:
1. Compute posterior P' from prior P and information I according to Bayes' rule.
2. Perform decision argmax_d U(d,P').
Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?
> Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?
Sure, as I said anything involving ROC curves, where we pick an operating point by trading off the cost of false positives vs. false negatives and a design rate of incoming true positives and negatives.
Can you give a mathematical recipe with assumptions and deductions? ROC curves don't cut it, it's just twiddling of a parameter of a classifier. Is it optimal in any sense? How do you know it's a good classifier for making decisions, when it's a classifier based on "given the hypothesis, how likely is the data" and not the other way around? Is it generalizable to other situations?
Yes, given a design rate of true positives or negatives, and a cost for false positives and false negatives, you can pick the optimal operating point. It will be optimal in the sense of minimizing average cost when the incoming rate equals the rate you designed for. You'll get the exact same answer as a "Bayesian" who uses conditional probability to calculate the same thing and whose prior equals the design rate. I gave a worked-out example in my original post ("If you want to decide whether to take action...").
Sure, it is generalizable -- we use ROC curves for radar, medical imaging, almost any diagnostic test...
Sure, the problem is not which point on the ROC curve you pick, the problem is which classifier you use to obtain it in the first place. I can pick a random classifier with a tunable parameter and draw its ROC curve and then pick the "optimal" point, but if the classifier sucks then that's no good. Why would a frequentist classifier based on a hypothesis test be good? A hypothesis test is the answer to the wrong question for the purposes of making a decision.
As I showed above, you can indeed get the same result from Bayesian decision making if you use a weird prior and utility function, which shows that frequentist decision making based on hypothesis tests is a subset (of measure 0) of Bayesian decision making. Again, that just means that you encoded a most likely wrong prior and utility in the choice of method without any justification.
Even more here: http://qr.ae/17BEW
The truth is they both make tradeoffs that can appear ridiculous. In fact, the criticisms of confidence intervals and p-values apply almost exactly, in transpose, to credibility intervals and posterior probabilities.
Confidence intervals and p-values are a worst-case technique. The p-value will always control the false positive rate below alpha, even in the worst input. Sometimes you do want this -- e.g. when we say that the worst-case runtime of QuickSort is O(n^2), that's useful, even if we do have a prior distribution over the inputs and could also say that the expected runtime is O(n log n). But the errors are correlated across observations. You can have a valid "95%" confidence interval that always produces total nonsense when the experiment ends up with output x, as long as x happens <5% of the time for all possible inputs.
Credibility intervals and posterior probabilities are an average-case technique, where we integrate over the prior. Even if the prior is correct, the errors are correlated across inputs, which can be a problem. In the cookie-jar example at stackexchange, the 70% credibility interval is "wrong" 80% of the time when the jar is type B. That means if you send out 100 "Bayesian" robots to assess what type of jar you have, each robot sampling one cookie, you will expect 80 of the robots to get the wrong answer, each having >73% posterior probability in that wrong conclusion! That's a problem, especially if you want most of the robots to agree on the right answer. The two methods just make different tradeoffs in the way they quantify uncertainty.
My quibble with the cartoon, though, was that it's not really about the frequentist vs. Bayesian debate. If you want to decide whether to take action (like shuttering a satellite) in response to a "YES" output from the instrument, everybody will agree that you need to calculate {rate of events} * {false negative rate} * {cost of false negative} and compare it with {1 - rate of events} * {false positive rate} * {cost of false positive}.
The frequentist agrees with this math, the Bayesian agrees with this math, and the math doesn't even use Bayes' rule. This is basic actuarial science or decision theory.
The frequentist might do the mechanics in a certain way. They may say they are first going to calculate a p-value, and then ask whether the p-value is less than a threshold alpha, where alpha was set based on the costs and rate of events in order to control the "false discovery rate." And then take action only on a "significant" result.
The Bayesian might do the calculation a little differently too; they could say they are first going to use the expected rate of events as a prior, then calculate the conditional probability that there has been an event (given the instrument's reading), and then multiply this posterior probability by the cost of false negative, and its complement by the cost of false positive, to decide which action has lower expected cost.
But both the frequentist and Bayesian will get the same answer and end up with the same result as somebody who evaluates the inequality above directly. I don't think any technique has a monopoly on the correct answer here.