Here is my writeup on the "real" difference between frequentist and Bayesian met...

btilly · on Nov 11, 2012

Upvoted for the "real difference" link.

I personally think that both groups are doing it wrong. I do a lot of A/B testing. In A/B testing what you care about is this:

1. Not getting a horribly wrong answer.

2. Getting an answer quickly.

A frequentist can tell me how to avoid getting an answer, but has no idea of the fact that below some threshold of data any answer is likely to be chance, and the errors that I can make are severe. And conversely if I have enough data, I'm likely to find real answers, and the mistakes I am likely to make are acceptable.

A Bayesian can tell me - in principle - that there is a threshold below which I should be cautious of making decisions and a threshold above which I can make decisions more easily. But naive priors set those thresholds too low, and I do not have sufficient data to come up with a real prior to use. I could create a conservative prior, but it would be hard to explain to anyone what I am doing.

In practice I've found that it is effective to blend the approaches. I compute frequentist statistics, but based on my past experience and knowledge of the ease of making severe errors, I insist on very high confidence levels for low amounts of data, and much lower for high amounts of data. Based on some numerical simulations, my true error rates seem acceptably low, and experiments run acceptably quickly.

(If I did not care about the speed of testing, then I could just set a rule like, "Go with the first version to get 296 conversions ahead." If the two versions have conversion rates that differ by 1%, then 95% of the time I will get the right answer. If the difference is larger, I will get the answer even more often. If the difference is smaller I will get the wrong answer more often - but the errors that come will be small and on average I'm still making good business decisions. All of the complex stats I actually do are just about getting answers quickly without compromising how often, on average, I make bad business decisions.)

jules · on Nov 11, 2012

The posterior distribution is only half the story. A true Bayesian uses a utility function to make decisions. How much you care about the worst case vs the average case is in the utility function. That's exactly the problem with frequentist methods: you're still making assumptions, but they are implicit and hardcoded in your choice of method, instead of explicitly stated and tweakable. With a particular choice of prior and utility function, you can recover many frequentist methods, but in most cases those will not be the prior and utility function you actually want. For example maximum likelihood estimation corresponds to a utility function equal to the likelihood (i.e. the probability mass), which at the very least should strike you as ridiculous for continuous quantities (maximum likelihood can still be useful as an approximation technique if the problem is intractable with your actual utility function). With a frequentist method, you are using a prior, you just don't know which one.

For some problems you might be able to get the correct decision in a very roundabout way by setting your alpha to the right magic value, but (1) it's not clear how to find the right alpha and (2) in general you cannot encode a complete utility function into a single number.

keithwinstein · on Nov 12, 2012

> With a frequentist method, you are using a prior, you just don't know which one.

I don't think so. Look over the cookie-jar example. The confidence interval guarantees worst-case coverage at least equal to its confidence parameter, for all values of the parameter. The credibility interval gives average-case coverage, integrated over the prior.

The confidence interval gives guaranteed coverage for every value of the parameter (conditioned on each possible input value). The credibility interval includes enough mass in the conditional probability function, conditioned on each possible output observable.

These are different mathematical objects and they do different things. The confidence interval doesn't use a prior over input values; it is giving you guaranteed coverage for any input value.

Let me put it this way: if you think the frequentist method is using some prior, what choice of prior will make the 70% credibility intervals in the cookie-jar case be identical to the 70% confidence intervals?

Anybody can think about utility to make decisions; it's not unique to Bayesian methods. Statisticians and engineers have been calculating ROC curves and choosing operating points on the ROC frontier (based on cost/benefit analysis) since World War II.

jules · on Nov 12, 2012

I meant that in the context of making a decision. The point of statistics is to make decisions. For example you want to know whether a medical treatment works so that you can decide whether or not to give it to people. So you do a hypothesis test to see whether the treatment works better than a placebo, and then if the p-value is small enough you give it to people, and otherwise you don't. Instead of explicitly separating the assumptions (prior & utility) from the logical deduction, the assumptions are embedded in this procedure. Why would the assumptions implicitly made by the choice of procedure be the assumptions you want to make? You take the answer to a question that's irrelevant to the decision, namely "given that the treatment doesn't work, how likely is the data" and try to tweak a decision based on that. There is no principled way to make decisions based on that information.

Credibility intervals are about average-case coverage, but Bayesian statistics as a whole is definitely NOT just about average case. In general the utility `U` is a function of the decision `d`, and of the posterior knowledge you have of the world `P`. In many practical cases the utility might be the expected profit: U(d,P) = integral(profit(x,d)P(x)dx). But it certainly doesn't have to be. If you are risk averse you might choose your utility as U(d,P) = min_x profit(x,d) to ensure that your utility is the minimum profit you make given a decision, rather than the average. Another example is U(x,P) = P(x) which gives maximum likelihood estimation. Making a decision based on a hypothesis test can also be emulated with a utility function. Suppose the hypothesis is H and we make decision d1 if p-value > alpha and d2 if p-value < alpha. We choose a prior P(I) that makes each possible observed data set I equally likely, and we choose the utility function that reverses Bayes' rule to compute P(I|H) to make a decision based on that:

    U(d1,P') = [P'(H)*P(I)/P(H) > alpha]
    U(d2,P') = [P'(H)*P(I)/P(H) < alpha]

where brackets are indicator notation. Note that P(I) appears to access the data set which the utility function does not have access to, but recall that P(I) is constant regardless of the measured data. Of course not many people have such a prior and utility function...so it doesn't really make sense to hard code them into the method.

In general the process works like this. Given prior P and utility U and measured information I:

1. Compute posterior P' from prior P and information I according to Bayes' rule. 2. Perform decision argmax_d U(d,P').

Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?

keithwinstein · on Nov 12, 2012

> Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?

Sure, as I said anything involving ROC curves, where we pick an operating point by trading off the cost of false positives vs. false negatives and a design rate of incoming true positives and negatives.

jules · on Nov 12, 2012

Can you give a mathematical recipe with assumptions and deductions? ROC curves don't cut it, it's just twiddling of a parameter of a classifier. Is it optimal in any sense? How do you know it's a good classifier for making decisions, when it's a classifier based on "given the hypothesis, how likely is the data" and not the other way around? Is it generalizable to other situations?

keithwinstein · on Nov 12, 2012

Yes, given a design rate of true positives or negatives, and a cost for false positives and false negatives, you can pick the optimal operating point. It will be optimal in the sense of minimizing average cost when the incoming rate equals the rate you designed for. You'll get the exact same answer as a "Bayesian" who uses conditional probability to calculate the same thing and whose prior equals the design rate. I gave a worked-out example in my original post ("If you want to decide whether to take action...").

Sure, it is generalizable -- we use ROC curves for radar, medical imaging, almost any diagnostic test...

jules · on Nov 12, 2012

Sure, the problem is not which point on the ROC curve you pick, the problem is which classifier you use to obtain it in the first place. I can pick a random classifier with a tunable parameter and draw its ROC curve and then pick the "optimal" point, but if the classifier sucks then that's no good. Why would a frequentist classifier based on a hypothesis test be good? A hypothesis test is the answer to the wrong question for the purposes of making a decision.

As I showed above, you can indeed get the same result from Bayesian decision making if you use a weird prior and utility function, which shows that frequentist decision making based on hypothesis tests is a subset (of measure 0) of Bayesian decision making. Again, that just means that you encoded a most likely wrong prior and utility in the choice of method without any justification.