SkepStat: 2007

Sunday, December 2, 2007

Why I am a JavaScript master

Here at SkepStat, we try to stay abreast of the important news stories of the day. That's why I'm here, posting fewer than two weeks after my previous post (!), to discuss this amazing news: Study: Initials may impact performance.

After appearing on the wires, this story was picked up and published by any number of news outlets. It's based on a study published in the December issue of Psychological Science by Leif Nelson and Joseph Simmons: MonikerMaladies: When Names Sabotage Success. There's a lot here for a sensationalist to love. The authors' basic claim is that a person's initials directly affect their performance on a variety of physical and cognitive tasks. They attempt to establish this claim with five experiments in three different domains:

Do initials affect performance in major league baseball players?

Do initials affect school grades? (Specifically, are students whose names begin with A more likely to get A's, B initials more likely to get B's, etc.) And does this translate into enrollment in lower-ranked institutions?

Do initials affect performance on a laboratory cognitive task?

Batter up!

I'm hoping to have the time to address all aspects of the paper in the coming days (ok, weeks), but I'm going to start, as is only appropriate, by discussing Study 1: Baseball Performance. As mentioned above, the general hypothesis is that initials affect performance among Major League baseball players. The authors chose to operationalize this hypothesis in a very specific manner: Do players whose first and/or last initial is "K" have a higher career strikeout rate than those who don't have "K" names?

I know, that just sounds stupid. The theory is this: people generally either have an affinity for or lack an aversion to objects / behaviors / events / etc. that involve their initials. (This is an active area of research, and there have been a number of studies that suggest such connections. None of these studies are exactly home runs, though, if you will.) And "K" is the formal scorecard notation in baseball for a strikeout. So, if your initials contain a "K", you're less averse to striking out than if not.

The procedure the authors used to test this hypothesis is quite simple and, since the data is publicly available (baseball-databank.org), can be easily replicated at home. They calculated career strikeout rate for every player with at least 100 plate appearances during the years 1913 - 2006, and they compared the mean rates between those players with and those without "K" initials using a two-sample t-test.

The results were fairly striking: "K"-initialed players had a career strikeout rate of 18.8% of plate appearances, compared with 17.2% for non-"K" players. This difference was statistically significant (t = 3.08, df = 6395, p = .002).

The analysis and its discontents (e.g., me)

While this analysis has the advantage of simplicity, there are several potential pitfalls. First, the t-test assumes that the strikeout rate has a normal (bell-shaped) distribution. As the figure below demonstrates (non "K" names on top, "K" names on the bottom), this assumption is clearly violated: the strikeout rate is highly skewed to the right (perhaps due to the inclusion of pitchers and of journeyman minor leaguers with few career at bats in the majors - 100 plate appearances is not a high bar). This problem could be greatly mitigated by use of a nonparametric statistical test (the Mann-Whitney U test, say), or by the simple expedient of transforming the data. (A square-root transformation appears to do nicely.)

Another difficulty is that not all observations of career strikeout rate are created equal. We have much more reliable estimates of career strikeout rate for players with thousands or tens of thousands of plate appearances than for those with a couple hundred. It would be a simple matter to differentially weight each player's contribution to the analysis by his number of plate appearances or the variance in his estimated strikeout rate. On the other hand, it's not clear exactly what the implications of such weighting would be, since strikeout rate is itself substantially correlated with number of plate appearances. (Naturally, players with low strikeout rates tend to have long careers.) What is clear is that the authors were unaware of these thorny questions, or chose to sweep them under the rug.

There are other questions that arise in the analysis (should strikeout per plate appearance be treated as a binary outcome and analyzed via logistic regression, should strikeout rate have been collapsed across years or would it be better to model years separately in a repeated measures model, etc., etc.). But the truth is that none of these issues really matter all that much compared to the major, inexcusable failing of this study: lack of thoroughness and consistency.

Lack of thoroughness and consistency

Let's say, just for the sake of argument, that we believe the authors' theory. Under this theory, there are several other hypotheses that would be at least as true and equally testable as the hypothesis that "K" initials influence strikeout rate. It's actually rather peculiar that the authors wouldn't have thought to perform and report these additional, simple, experiments. But, that's why I'm here! For the sake of simplicity and comparability, we'll use the same (suboptimal) analysis used in the paper.

Hypothesis: Batters with "H" initials have higher hit rates than those without.
Data: "H" initials: 20.9% hit rate. Non-"H": 21.2% hit rate.
Result: False (t = 0.43, df = 6203, p = .67). Effect in wrong direction.

Hypothesis: Batters with "B" initials have higher walk rates than those without. (The scorecard notation for a walk is "BB".)
Data: "B" initials: 7.4% walk rate. Non-"B": 7.4% walk rate.
Result: False (t = -1.03, df = 6203, p = .31). (Note: this negative result despite the heroic contributions of Barry Bonds to the field of walking.)

Hypothesis: Batters with "S" initials have higher strikeout rates than those without. (After all, while "K" is the official scorecard notation, doesn't "strikeout" actually begin with an "s"?)
Data: "S" initials: 16.4% strikeout rate. Non-"S": 16.9% strikeout rate.
Result: False (t = 0.05, df = 6203, p = .96).

Hypothesis: Batters with "W" initials have higher walk rates. (Again, "BB" is the scorecard notation, but "walk" begins with "w".)
Data: "W" initials: 7.2% walk rate. Non-"W": 7.4% walk rate.
Result: False (t = 0.11, df = 6203, p = .91). Effect in wrong direction.

So, the question is this: why on Earth would "K" batters have more strikeouts, but none of the other hypotheses above would be true? And why wouldn't the authors have thought to test the above hypotheses?

Addendum
I've received (through backchannels) an interesting comment suggesting that pitchers, rather than batters, are primarily responsible for walks. I'm not sure whether or not I agree with this, but it's certainly the case that, for each at bat, both pitcher and batter are intimately involved in the outcome, whether it be walk, strikeout or hit. Accordingly, here are the analyses for pitchers, corresponding to those above for hitters. I included all pitchers with at least 100 innings pitched, rates are calculated per inning pitched.

Hypothesis: Pitchers with "K" initials have higher strikeout rates than those without.
Data: "K" initials: 0.55 strikeouts/ip. Non-"K": 0.55 strikeouts/ip.
Result: False (t = 0.11, df = 3339, p = .91). Effect in wrong direction.

Hypothesis: Pitchers with "H" initials have higher hit rates than those without.
Data: "H" initials: 1.02 hits/ip. Non-"H": 1.02 hits/ip.
Result: False (t = -0.28, df = 431, p = .78; Satterthwaite correction).

Hypothesis: Pitchers with "B" initials have higher walk rates than those without.
Data: "B" initials: 0.40 walks/ip. Non-"B": 0.40 walks/ip.
Result: False (t = 0.53, df = 3339, p = .60). Effect in wrong direction.

Hypothesis: Pitchers with "S" initials have higher strikeout rates than those without.
Data: "S" initials: 0.55 strikeouts/ip. Non-"S": 0.55 strikeouts/ip.
Result: False (t = -0.61, df = 3339, p = .54).

Hypothesis: Pitchers with "W" initials have higher walk rates than those without.
Data: "W" initials: 0.42 walks/ip. Non-"W": 0.40 walks/ip.
Result: True (t = -2.95, df = 333, p = .004; Satterthwaite correction).

So, interestingly, there is one significant result: pitchers with first or last names beginning with the letter "W" have more walks per inning pitched than others. But, really, this is the exception that proves the rule.

In the case of batters, it was those whose initials coincided with the somewhat arbitrary scorecard notation for strikeout ("K") who had higher strikeout rates than others. Those whose initials coincided with the word "strikeout" ("S") had no significant elevation in strikeout rate.

On the other hand, in the case of pitchers, those whose initials coincided with the scorecard notation ("B") did not have a higher walk rate than the non-"B"s, whereas those whose initials coincided with the word "walk" ("W") did have a higher walk rate than others. This arbitrary pattern of results and total lack of consistency is wholly consistent with chance findings.

Wednesday, November 21, 2007

Skeptics' Circle #74

Regular readers of my quasi-monthly posts may be interested in checking out Skeptics' Circle #74, a well-done turkey ably hosted on this holiday week over at Med Journal Watch.

Listen to the sound of my voice. You are getting verrry pregggnant....

Today's post represents a departure for SkepStat: I'm going to be discussing an article from a respected mainstream journal (Fertility and Sterility, ranked 4th among reproductive biology journals, according to ISI impact factors), rather than from a fish-in-the-barrel clueless CAM journal. The article, by Eliahu Levitas, M.D., et al., of the Soroka University Medical Center, Beer-Sheva, Israel, is entitled Impact of hypnosis during embryo transfer on the outcome of in vitro fertilization–embryo transfer: a case-control study.

Biological plausibility

Another departure is that, while hypnosis isn't exactly on the soundest of evidence bases for medical applications, it's not in the same league of quackery as energy healing or electroacupuncture. However, it's very difficult to imagine any biological plausibility for an effect of hypnosis on the success rate of in vitro fertilization. I'm admittedly straying far from my expertise here, and will welcome correction in the comments, but the only mechanism of action for such a relationship that I can think of would involve psychological factors influencing reproductive hormone release. Which is not in itself implausible by any means, but IVF patients receive such high doses of exogenous hormones that it seems it would be unlikely that additional suggestion-induced endogenous hormone release could make any difference.

The authors, interestingly, suggest that the efficacy of hypnosis on IVF success rate would stem from reducing patient anxiety, which they assert can negatively affect success rate: "Patients perceive [embryo transfer] as the culmination of the IVF treatment, and therefore stress is often present. Patient fears are related to a potentially negative treatment outcome as well as to any possible discomfort related to the procedure." In the discussion, they draw a vague link between anxiety and uterine contractions that may disrupt successful embryo transfer. If this were the case, I could suggest about a dozen better-proven treatments for acute anxiety than hypnosis, of course, but I suppose that's beside the point.

When is a case-control study not a case-control study?

Biological plausibility or implausibility aside, we're here to review the evidence and, more specifically, the methods used to acquire and interpret that evidence. Let's start with the study design: The authors describe this, in the title, as a case-control study. It's not a case-control study. A case-control study involves identifying a group of subjects with some health outcome of interest (the "cases"), matching them with another group of subjects without the outcome of interest but otherwise as similar as possible to the cases (the "controls"), and then comparing the two groups on the rate of some exposure of interest (perhaps an environmental or genetic factor, perhaps a therapeutic intervention). If the cases have a statistically significant greater or smaller rate of the exposure than the controls, you would conclude that the exposure is / might be associated with the outcome. Case-control studies are often used in studying relatively rare health outcomes, and are mostly effective in testing for associations between outcomes and relatively common exposures. For instance, the link between lung cancer and smoking was established through a series of very large case-control studies (take a bunch of lung cancer patients, a bunch of otherwise similar people without lung cancer, compare the rates of smoking between the groups).

So, what would a case-control study of the relationship between hypnosis and IVF success look like? You'd start with a group of women with successful IVF pregnancies, you match them to an otherwise similar group of women without successful IVF pregnancies, and then you'd compare the rates of hypnosis administration during embryo transfer (ET) between the two groups. Obviously, such a study couldn't possibly work, because the exposure of interest (hypnosis during ET) is far too rare. Luckily, despite their assertions to the contrary in the title of their paper, the authors didn't actually conduct a case-control study. What they actually did was recruit a bunch of women to receive hypnosis during ET, find an otherwise similar group of women from their IVF practice who didn't receive hypnosis, and then compare success rates between the two groups. This puts the study design in a nebulous category somewhere between a controlled clinical trial (in which patients are assigned to receive an experimental or control therapy, and then outcomes are compared) and a cohort study (in which patients who do and do not receive an exposure of interest are followed over time to compare outcomes).

If pressed, I would describe this study design as a "clinical trial with matched historical controls". It's a suboptimal design in many ways, compared with a proper randomized controlled trial. (And, given the circumstances described in the paper, there's no reason whatsoever that an RCT couldn't have been conducted just as easily as this ad hoc experimental design.)

The first problem is that the controls were subjected to "usual care," rather than a protocolized control treatment, which would have involved deliberately treating the controls as similarly as possible to the experimental subjects (including using the same IVF providers, the same equipment, at the same time of day, etc., etc.).

Second, and most obviously, the treatments weren't randomly assigned. All of the patients who consented for hypnosis were given hypnosis. None of the controls consented for hypnosis - either they refused, or it wasn't offered. There may well be (and apparently are, as we shall see) meaningful differences between women who are and are not willing to undergo hypnosis. Randomization would have meant comparing women who consented to undergo hypnosis and who received hypnosis to women who consented to undergo hypnosis but did not receive hypnosis. Apples to apples, not apples to oranges.

Also, there was no blinding to treatment assignment at any stage of the experiment. This would include during the process by which controls were being matched to hypnosis subjects. This presents a big opportunity for investigator biases to influence (either consciously or subconsciously) the selection of controls to be women generally less likely to have successful IVF. (And the controls did apparently have a lower baseline chance of success, as, again, we shall see.)

Selection bias, table for two (or more ETs)

A huge potential problem with this study (and the one that caught my eye first when reading the abstract) is that the "subjects" in the study (the "independent sampling units," in statistical terms) aren't women or couples, but cycles. Meaning that some women were followed for more than one cycle and each cycle was included in the analysis as if it represented an independent trial. My original concern with the study is that this violates the "statistical independence" assumption that underlies most of the basic statistical techniques (such as were used in the paper, described below). Multiple cycles from the same woman have more in common with each other than with cycles from other women, and failing to take this into account in data analysis can lead to completely invalid results. On closer reading, however, I don't think violation of independence poses such a huge problem here, since there were only a handful of women who contributed repeated cycles....

BUT, all those women came from the hypnosis group (98 cycles from 89 women), not from the control group (96 cycles from 96 women). What does that mean? It means that some of the women from the hypnosis group went through a cycle, failed to conceive, and were given a do-over. Maybe a few women from the hypnosis group were followed until they conceived, maybe the ones with the best prognosis. And this same privilege wasn't extended to any of the women in the control group. This is putting your thumb on the scale, big time. It's like comparing two cold remedies, following most people for one week, but following a handful of members of just one of the two groups until they got better. Obviously the group that gets the do-overs is going to have more successes.

If you don't know what your experiment is, does that mean you don't know how to analyze it?

Not necessarily, but it's certainly not a good sign! (I actually consider the incorrect description of the study, including in the title a much greater - almost unforgivable- failing on the part of the reviewers and editors than of the authors.) The data analysis is described in the paper in three short paragraphs:

Univariate analysis was performed using χ², Fisher’s exact test, Wilcoxon matched-pairs signed-ranks test, and one-way analysis of variance test when appropriate.

To evaluate the effect of hypnosis during ET on pregnancy occurrence adjusted to the different confounding factors, logistic regression analysis was performed for the dichotomic dependent variable—pregnancy—with the independent variables found significant in univariate analysis, such as hypnosis during ET.

Statistical analyses were performed using Statistical Programs for the Social Sciences (SPSS, version 11.0, Chicago) software programs. P<.05 was considered statistically significant.

Ok, the first paragraph is a little like describing the operation of a motor vehicle as "application of the gas pedal, the windshield wipers and the horn when appropriate." There's no evidence here that they did anything terribly wrong, but there's no evidence that they did anything correctly either.

Fisher's exact test and Pearson's chi-square (χ²) test are both used to test the statistical significance of associations between categorical variables; Fisher's exact test is generally preferred when the two variables are both dichotomous (i.e., binary or two-valued) and when there are a small number of subjects who meet some combination of the two variables. It's impossible to tell when the authors chose one over the other, since they only report whether each test is significant, and don't include test statistics, degrees of freedom, or p-values. (A big no-no in scientific write-ups.) But, here's the main thing: since the samples were one-to-one matched, case to control, they could have (and should have) used a statistical analysis that accounts for the pairings. This would have been either McNemar's test, or a one-sample binomial test on the paired differences.

It's not at all unreasonable to expect them to know this, since they did claim to use a non-parametric matched sample test (the Wilcoxon signed-rank test), sometimes. Again, it's not clear when they used this test (which does not assume that variables are normally distributed) and when they used the parametric one-way ANOVA (which does). And this is another red flag of profound statistical ignorance: when comparing two paired groups on the mean level of a continuous variable (with an underlying normal distribution), the standard approach is a paired t-test. ANOVA is used to compare more than two groups on the mean of a continuous variable. And, while it is the case that an ANOVA conducted with two groups instead of more than two groups is mathematically equivalent to a t-test, there is no paired samples version of one-way ANOVA, and thus no equivalence between one-way ANOVA and a paired t-test. (The previous sentence was edited for general incorrectness; as commenter Efrique helpfully pointed out, there is a correspondence between a special case of two-way ANOVA and a paired t-test.) And no one competent in statistical analysis would describe a two-sample test as an ANOVA anyway, even in the circumstances where that would be technically correct. It's another tell.

The main analysis described is a logistic regression. This is an appropriate choice, but again, the authors should have and failed to take into account the one-to-one matching in the analysis (by means of a conditional logistic regression, for instance).

(Of course, how they matched 98 hypnosis cycles to 96 control cycles in a one-to-one fashion is anyone's guess, but they do claim to have done so.)

The results (?)

The hypnosis group had more successful pregnancies (53% of cycles) than the non-hypnosis group (30% of cycles). This is described as a statistically significant difference, although such a claim would presuppose an experimental design and analytic plan that allowed for valid statistical significance tests. A dubious presupposition, at best.

But far more interesting (and very telling) are some of the results comparing the baseline characteristics of the hypnosis and control groups (which, remember, were not randomly allocated):

The women in the hypnosis group had been infertile for an average of 4.7 years, but the women in the control group had been infertile for an average of 7.4 years (a statistically significant difference).
47% of the women in the hypnosis group had primary infertility, compared with 74.2% of the women in the control group (a statistically significant difference). (Primary infertility means that the couple had never been able to conceive, as opposed to secondary infertility, meaning that the couple already has at least one child.)
18.4% of the women in the hypnosis group had "unexplained" infertility, compared with 10.3% of the women in the control group (not a statistically significant difference).

All of these differences point to the same thing: more severe infertility among the control group than the hypnosis group. Probably more than enough so to explain any observed difference in pregnancy rates following embryo transfer. Some of these factors were included in the logistic regression model to try to control for differences but, as the old statistical maxim goes, "You can't fix by analysis what you screwed up by design."

In short, a fatally flawed paper, and some astonishingly derelict editorship from a legitimate journal.

Friday, September 14, 2007

No thanks, I don't need any treatment. I've got a resonant bond!

Sorry about the long delay since my last post. I've been getting my chakras realigned.

But I'm back, with a treat. Today we'll be enjoying a woo-woo challenge to clinical trials methodology, in the form of Resonance, Placebo Effects, and Type II Errors: Some Implications from Healing Research for Experimental Methods, published in The Journal of Complementary and Alternative Medicine by William F. Bengston ( an energy healing sociologist) and Margaret Moga (an anatomist and moxibustion enthusiast).

Randomized controlled trials

The randomized controlled trial (RCT) is generally considered to be the gold standard experimental design for medical research. RCTs are widely used and there's a huge literature on RCT methodology (including a journal entirely devoted to the subject), but the basic idea is quite simple.

Let's take a relevant example. Suppose we want to examine the efficacy of, say, laying-on-of-hands "energy healing" for shrinking tumors in mice. Well, efficacy is always efficacy relative to something, so we need to choose an appropriate control treatment. In this example, let's say we're interested in comparing "energy healing" with no treatment at all.

Well, we need to start with a sample of mice with tumors. In animal research, when you want to study animals with a certain disease, you usually start with healthy animals and then give them the disease you want to study. (Generally, this practice would be looked at somewhat unfavorably in human subjects research.) So, we'll take a bunch of mice and inject them all with tumor cells. Stop looking at me like that, animal rights activists. Anyway, maybe one of the experimental mice would have gone on to become the mouse equivalent of Hitler. You never know, do you?

Anyway, the next step is key: we randomly assign the cancer mice to receive either the "energy healing" or no treatment at all. The principle of randomization is deceptively simple, but it has profound implications. The most immediate implication is that, because all personal characteristics of the subjects are randomly distributed between the groups, any observed differences in outcome are likely to be due to the treatment assignment rather than to irrelevant factors like age, sex, socioeconomic status, or whatever the mousy analogues of these might be. It's no exaggeration to say that the introduction of randomization by (principally) R. A. Fisher, was one of the seminal moments in the history of science.

Then we give our treatment group the treatment of interest (i.e., laying-on-of-hands) and our control group the control treatment (i.e., no treatment). After an appropriate length of time, we measure the tumors on all our mice, calculate the remission rate for the treatment group and the control group, and use standard statistical methods (Fisher's exact test, say, although more on this subjet another time) to decide whether the observed difference in remission rates is larger than would be expected due to simple chance differences between the groups.

The inconvenience of negative results

Now suppose you conduct the experiment described above not once, not twice, but four separate times, each time with the same result: many of your "energy healing" treated mice undergo tumor remission, but so do many of your control mice. In fact, in each experiment, the difference between treatment and control groups is smaller than would be expected due to chance ("not statistically significant").

Now, a few possible explanations might spring to mind for these negative results:

The tumor induction was inadequate in both control and treatment mice to result in a sustained cancer. The majority of mice remitted because their disease was self-limiting. The "energy healing" treatment doesn't work better than no treatment.
The tumor induction was adequate but some physiological process or trait separate from the treatment but common to all mice resulted in tumor remission in the majority of mice. The "energy healing" treatment doesn't work better than no treatment.
#1, except the "energy healing" treatment does work better than no treatment, but the effect is small and the sample size was inadequate to determine statistical significance.
#2, except the "energy healing" treatment does work better than no treatment, but the effect is small and the sample size was inadequate to determine statistical significance.
The study protocol was broken in some way due to carelessness or fraud.
THE VERY ACT OF RANDOMIZATION CREATES A MYSTICAL RESONANT BOND BETWEEN THE TREATMENT AND CONTROL MICE WHICH MEANS THAT ANY ENERGY HEALING APPLIED TO THE TREATMENT MICE IS TRANSFERRED TO THE CONTROL MICE. THE TREATMENT WORKS SO WELL THAT IT WORKS EVEN ON MICE THAT DIDN'T ACTUALLY RECEIVE IT!!@$!@@$!!

Now put your scientific thinking caps on. Does one of those explanations just jump out at you as the most plausible, rigorous, parsimonious? I added some emphasis to give you a hint. But take your time. Go ahead.

That's right! Number 6 is obviously gold. The really great thing about the hypothesis underlying number 6 is that, the more negative your results are, the more amazingly effective your treatment must be! The only way things could look better for the treatment is if more control mice remitted. And not only does the hypothesis explain this experiment, it single-handedly accounts for all placebo effects in all clinical trials, ever. This is paradigm-shattering stuff, truly.

But let's not jump to conclusions. We're scientists here. We like the cut of hypothesis 6's jib, sure, but we're not going to just jump in and publish an article proposing an occult relationship between control and treatment groups in RCTs without first making damn sure we can back it up with science. What we need is an experiment....

How do you test an insane hypothesis?

Why, with an insane experiment of course. SILLY! AHAHAHAHAHAAHAHA.

Ok, so here's what we'll do (and, in case you haven't caught up with my rhetorical stylings so far in this post, "what we'll do" translates to "what Bengston and Moga did"). We'll take 30 mice, inject them with tumor cells, and randomly assign half to "energy healing" and half to no treatment. So far so good - it sounds like a replication of our previous experiments.

But wait, you're saying, what about the resonant bond? If our revolutionary hypothesis is correct, the two groups will be mystically connected, all the tumors will remit, and we'll be back to square one. What we need is a TRUE CONTROL GROUP that won't be bonded with the treatment group, resonantly or otherwise. Then we can see if our two original groups are more similar to each other in course of disease than they are to the third control group.

But how to make that third, TRUE control group? Well, why not take another 25 mice and not inject them with cancer cells at all in the first place? Just 25 mice. No tumors. Then we'll follow all three groups, see if remission rates and biological markers are similar in the treatment and bonded control groups and different in the group that, hey, we never gave cancer in the first place. Why this group should be immune to the mystical resonant bond is anyone's guess, of course, but it's worth a shot, right?

Well, ladies and gentlemen, 55 innocent (or, who knows, maybe not so innocent) mice and several inappropriate statistical analyses later (the rest of this is so much fun, I can't even be bothered to critique their statistics), our long hours of work have paid off. Remission rate in the "energy healing" treatment group? 100%. In the bonded control group? 100%, that's all. Just a paltry 100%. And what about the third control group? The group that NEVER HAD CANCER IN THE FIRST PLACE? 0%. Zero. The big zilch. Nada. Not a single one of the mice that NEVER HAD CANCER IN THE FIRST PLACE remitted from their cancer.

Just in case that's not enough to convince you, Bengston and Moga also measured hemoglobin levels and weighed the spleens of a subset of mice from each group at each follow-up. Guess what? The group of tumor-injected mice that had "no" treatment and the group of tumor-injected mice that had "energy healing" treatment? Pretty much the same. And the group that never had tumor cells injected? Different!

An analogy

The setup: In case all of this talk of mice and tumors has gotten a bit esoteric, here's an analogy for what we've just learned. Let's take three cars: car A, car B and car C. We fill the fuel tank of car A with gasoline, plus a fuel additive. We fill the fuel tank of car B with gasoline only. We leave the fuel tank of car C empty.

The experiment: We line the three cars up and race them, to see which can go farthest in 10 minutes. Cars A and B go 20 miles each. Car C doesn't go anywhere.

The conclusion: The fuel additive is effective, and a resonant bond caused Car B to go as far as Car A. The existence of this resonant bond is proved by the fact that Car C didn't go anywhere.

And that just about settles that. Take a bow, Drs. Bengston and Moga. You've really done something here. Really.

Now those of you still reading shall be rewarded with a collection of choice excerpts.

Questioning the logic of experimental design is the last thing we want to do:

This paper does not question the logic of experimental
design. Rather, it suggests that, under some circumstances,
for example, illustrated by placebo effects, the presupposi-
tion of experimental and control group independence is
questionable. It suggests that this violation can occur via the
creation of a “resonant bond” between groups. Resonance,
in turn, can result in a macroscopic entanglement of exper-
imental subjects, so that a stimulus given to one group also
stimulates the other group.

Weeks of practice to master:

As previously reported, the healing-with-intent experi-
mental protocol required that the volunteer healers practice
mental and “directed energy” techniques taught to us by an
experienced healer formerly based in Great Neck, New York.
These techniques did not involve focused visualization, med-
itation, life changes, or belief of any sort. Although they are
straightforward, the mental techniques required weeks of
practice to master and involved a series of routine mental tasks
that were to be practiced simultaneously while placing hands
around the standard plastic mice cages for 1 hour per day.

Our curiosity got the better of us:

Our intent was to keep the control mice
separate for the duration of the experiment and to keep them
particularly hidden from anyone who knew the healing tech-
niques. Our curiosity got the better of us, however, and, within
several weeks of the first experiment, we violated protocol
and visited the control mice. In hindsight, this may have
proved fortuitous, because it inadvertently opened the door
to unexpected phenomena.

Naturally, it's all down to quantum entanglement:

Almost all of the seeming paradoxes of these remissions
disappear if we allow for the possibility of “resonant bond
formation” and “resonant bond dissolution,” which may
serve to entangle or de-entangle subjects. Certainly the no-
tion of “entanglement,” although still quite mysterious, is
widely accepted and hailed for its predictive power on a
quantum level in conventional physics.

You want an explanation of resonant bonds? Well, how about two explanations, smarty-pants?

Consider two possible hy-
potheses: (1) shared experiences among experimental sub-
jects can “bond” them together resonantly; and (2)
consciousness itself, including that of the experimenter, can
delimit the boundaries of experimental subjects, effectively
defining those who are “in” and those who are “out.” Those
who are “in” form something akin to a larger “collective,”
analogous to those formed by colonies of insects, flocks of
birds, and schools of fish.

All your failed studies are belong to us:

Researchers are encouraged to
reexamine their old data within the framework of resonance
to determine whether these phenomena are as extensive as
they now appear to be (e.g., placebos). This reexamination
needs to broaden the question from the difference between
experimental and control subjects to inquire more generally
about the difference between experimental subjects and
“what ought to have happened.”

Tuesday, July 24, 2007

But can it make your hair grow in three weeks?

Wow. Welcome to the crazy. Today's article is An investigation into the effect of Cupping Therapy as a treatment for Anterior Knee Pain and its potential role in Health Promotion by Ahmed Younis et al. It was published in the estimable Internet Journal of Alternative Medicine, so you know it must be true. At any rate, you know it must be freely accessible online, which is terrific. Such a service to humanity.

If you don't know what cupping is, that's probably because you were born sometime after the dawn of the 20th century. Cupping is... well, let's just ask the authors:

Cupping is an ancient method of treatment that has been used in the treatment and cure of a broad range of conditions; blood diseases such as haemophilia and hypertension, rheumatic conditions ranging from arthritis, sciatica, back pain, migraine, anxiety and general physical and mental well-being. The aim of Cupping is to extract blood that is believed to be harmful from the body which in turn rids the body of potential harm from symptoms leading to a reduction in well-being.

Now, I know what you're thinking^*: they're performing bloodletting? Is skepstat frakking kidding me? I am not frakking kidding you, but the authors are when they claim in their conclusion that:

The efficacy of the treatment of Cupping for Anterior Knee Pain, Range of Movement and well being has been researched and results reveal statistically significant differences in support of Cupping Therapy.

The gist: The authors sucked some blood from 15 participants complaining of knee pain and measured their range of motion and obtained self-reports of pain and well-being before the bloodletting, excuse me, cupping therapy, and three weeks after the bloodletting. I'm sorry, cupping therapy.

The results: Improvement in outcomes across the board! Low p-values! Huzzah!

The problems: As you should expect, there are any number of ginormous problems with the study. We'll cover the two worst:

No control group. Let's engage in a thought experiment. Suppose you take 15 random people complaining of acute knee pain (from any source, mind you), and ask them how their knee feels. They'll probably say it hurts. In fact, they already did. Now wave a kosher dill pickle at them and ask them again 3 weeks later. Chances are, they'll say it doesn't hurt quite so much. This could be seen as an example of what statisticians call "regression to the mean," but it's probably more accurately described by what doctors like to call "getting better."

This is why you always need a control group in treatment evaluation. There's absolutely no way to tell how much of the improvement is due to the experimental therapy if there's nothing to compare it to. In some circumstances you might want a control group that receives absolutely no intervention, in some cases a control group that receives placebo (not the same thing as no intervention, by the way), in some cases a control group that receives a different active intervention. But you always need a control group.

In other news, non-controlled studies have now shown that cupping therapy makes the sun set at night and then rise again in the morning.
Missing data. This is a problem that haunts legitimate research as well, but it's particularly bad in this case. The authors recruited 26 potential victims, but only 15 completed the trial. What happened to the other 40% of the sample? Well, 4 of them never showed up (maybe their pain got better?) and 7 didn't show up for follow-up (maybe their pain didn't get better? maybe they became afraid, so afraid of the fake doctors with the real razor blades?). Real scientists would have tried to include the 7 follow-up no-shows in their analyses, often by assuming that the missing subjects had no improvement at all (a so-called last-observation-carried-forward or LOCF analysis, although most statisticians will tell you that LOCF analyses are terrible). Luckily for the bloodletters, this isn't a real journal.

Favorite line: Cupping Therapy has no major side effects aside from minimal discomfort due to the method of application of skin cuts to the patient. In cases where the patient's pain threshold is low, a local anaesthetic can be administered.

Least favorite line: Ethical approval was sought from Kings College Research Committee.

You Brits really need to get your ethical act together.

^* Rhetorical device. I am not^** actually using psi powers to discover what you are thinking.

^** Not currently.

Saturday, July 21, 2007

Make a statistician cry

My treasured list of statistical putdowns. Use wisely:

I have no confidence in your intervals.
Your momma's so fat, her posterior is bimodal.
Your chi-square has no degrees of freedom.
All your priors are improper.
All your posteriors are noninformative.
Nothing you've ever accomplished in your entire life would survive a Bonferroni correction.
All your significant results are one-tailed.
There's no Box-Cox transformation that can fix your kind of non-normality.
I have more power in my pinky than you have in your entire grant proposal.
You're so stupid, you think SPSS is hard.
You're so stupid, you think association implies causality.
You're so stupid, you think you proved the null.
Why don't you take a random walk off a Brownian bridge?
You're non-significant at alpha = .10 (one-tailed).

Friday, July 20, 2007

Q: Do you really need informed consent to do research on fake medical education?

In the course of poking around the CAM literature for source material, I found this interesting study in the journal Chiropractic and Osteopathy: Do chiropractic college faculty understand informed consent: a pilot study, by Dana J. Lawrence and Maria A. Hondras.

Short answer to their question: no.

For those who may not know, informed consent is one of the fundamental principles of modern medical ethics, and it's the cornerstone of human subjects research. The need for guidelines requiring informed consent for all human subjects research was first recognized in the Nuremberg Code, in the wake of the horrific medical experiments conducted by the Nazis. The concept has since been refined in subsequent international agreements and in U.S. federal regulations.

When it comes to research on patients, researchers at medical schools are taught (in mandated educational modules everywhere I've heard of) that absolutely any human subjects research has to have prior approval from their Institutional Review Board (IRB). It's the IRB's responsibility to ensure that informed consent is obtained when appropriate, and only the IRB has the power to determine which studies might be exempt from the informed consent requirement. (Exemptions are usually granted in situations where consent in infeasible and risk is minimal, such as anonymous medical record reviews.) In my experience, this is well-understood by all researchers working with patient populations.

The situation becomes murkier to some when the topic is not medical research per se, but medical education research (say, writing down lists of grades on tests to see if medical students are performing better over time). When do you need consent from the medical students? And when do you need IRB approval? In fact, if you want to share the results of any such research, you have to go to the IRB. This is the first rule of research ethics. You don't collect a single piece of research data without clearing it with the IRB first. And the IRB decides whether you need informed consent from your students. (Default position: yes, you do.)

Lawrence and Hondras showed that this principle was very poorly understood at one chiropractic college (Palmer Center for Chiropractic Research in Iowa). Specifically, they found that among faculty survey respondents (and only 55% of faculty responded at all), there was widespread ignorance of policies for medical education research. To cite just a couple of their numbers, 65% of respondents were unsure whether there were any policies in place at all for student consent in education research. Only 27% of respondents correctly noted that students can decline consent for such research!

Unfortunately, I genuinely don't know how much better the situation would be at a real medical school. Points to the back-crackers for taking the time to research this subject.

Wednesday, July 18, 2007

Move aside, the cream and the clear. Here comes the electroacupuncturist!

I thought I'd start off with something nice and easy for my first post, so I looked up the most recent issue of the Journal of Alternative and Complementary Medicine and picked an article more or less at random: Bilateral Effect of Un ilateral Electroacupuncture on Muscle Strength by Li-Ping Huang et al. Performance-enhancing woo!

The gist: The authors randomly assigned 30 young men to two groups. Group 1 received electroacupuncture (that is, acupuncture with a weak electrical current delivered in continuous pulses through the needle) in their right leg three times a week for four weeks. Group 2 received no intervention. Dorsiflexion (foot elevation) strength was measured for each subject, in each leg, before and after the four week period, by means of a homemade device.

The results: Well, how about a mean 21.3% increase in strength in the right legs and a 15.2% increase in strength in the left legs of Group 1, compared to a measly 3.0% right leg and 4.8% left leg increase in Group 2? How's that for impressive, Mr. / Ms. Skeptic? Someone alert Barry Bonds!

The problems: Refreshingly, the statistical analysis was fairly reasonable and appropriate. The authors used a repeated-measures analysis of variance, with timepoint (pre vs. post), leg (right vs. left) and group as factors. There isn't enough detail to fully evaluate their statistics, but the basic idea is correct.

No, the main problem here was blinding. As in, there was no blinding. What does this mean? It means that, at all stages of the experiment, all the participants and all the investigators knew who was getting electroacupuncture and who wasn't. Why is this a problem? Because biases (both conscious and subconscious) on the part of the participants and the investigators can dramatically skew results.

Participants in Group 1 knew they were getting a treatment that was supposed to increase their strength, and they knew that the investigators wanted them to have increased strength at the end of the trial. This is more than enough motivation for most of these participants to try really hard at the post-intervention strength test. Naturally, individual strength performances can be strongly influenced by motivation and willpower. In the extreme, it could even have been enough motivation for some participants to, gasp, exercise in between sessions. Conversely, Group 2 knew that they weren't supposed to get stronger and that the investigators didn't want them to get stronger.

And as for the investigators: they could have been sending subtle (or not so subtle) signals to the participants to try harder (Group 1) or not so hard (Group 2) during the post-intervention strength test. This problem could have been easily mitigated by using blinded, independent assessors for the second strength test. Sadly, this was not the case.

Oh, and one last smallish problem with the study: utter biological implausibility. I mean, come on, you stick needles in the right leg and the left leg gets stronger? Scientist, please.

Favorite line: We did not collect data with respect to the balance of qi, whereas the subjects in this study were apparently healthy.

Inaugural post

Welcome to SkepStat! My plan for this blog is to critique (and perhaps occasionally praise) the statistical methods used in current research articles. Sort of a post-publication statistical peer-review. Unfortunately, too few journals engage in quality pre-publication statistical review, so there should be no shortage of material. I'm going to be focusing on debunking research that seems implausible, poorly conducted, or just silly, especially research into so-called complementary and alternative medicine (CAM). But anything is fair game!

My hope is that this format will provide some really good case studies for introducing or clarifying statistical concepts and methods. So, we'll have some fun, and maybe we'll learn a lesson or two along the way.

SkepStat