Wednesday, November 21, 2007

Listen to the sound of my voice. You are getting verrry pregggnant....

Today's post represents a departure for SkepStat: I'm going to be discussing an article from a respected mainstream journal (Fertility and Sterility, ranked 4th among reproductive biology journals, according to ISI impact factors), rather than from a fish-in-the-barrel clueless CAM journal. The article, by Eliahu Levitas, M.D., et al., of the Soroka University Medical Center, Beer-Sheva, Israel, is entitled Impact of hypnosis during embryo transfer on the outcome of in vitro fertilization–embryo transfer: a case-control study.

Biological plausibility

Another departure is that, while hypnosis isn't exactly on the soundest of evidence bases for medical applications, it's not in the same league of quackery as energy healing or electroacupuncture. However, it's very difficult to imagine any biological plausibility for an effect of hypnosis on the success rate of in vitro fertilization. I'm admittedly straying far from my expertise here, and will welcome correction in the comments, but the only mechanism of action for such a relationship that I can think of would involve psychological factors influencing reproductive hormone release. Which is not in itself implausible by any means, but IVF patients receive such high doses of exogenous hormones that it seems it would be unlikely that additional suggestion-induced endogenous hormone release could make any difference.

The authors, interestingly, suggest that the efficacy of hypnosis on IVF success rate would stem from reducing patient anxiety, which they assert can negatively affect success rate: "Patients perceive [embryo transfer] as the culmination of the IVF treatment, and therefore stress is often present. Patient fears are related to a potentially negative treatment outcome as well as to any possible discomfort related to the procedure." In the discussion, they draw a vague link between anxiety and uterine contractions that may disrupt successful embryo transfer. If this were the case, I could suggest about a dozen better-proven treatments for acute anxiety than hypnosis, of course, but I suppose that's beside the point.

When is a case-control study not a case-control study?

Biological plausibility or implausibility aside, we're here to review the evidence and, more specifically, the methods used to acquire and interpret that evidence. Let's start with the study design: The authors describe this, in the title, as a case-control study. It's not a case-control study. A case-control study involves identifying a group of subjects with some health outcome of interest (the "cases"), matching them with another group of subjects without the outcome of interest but otherwise as similar as possible to the cases (the "controls"), and then comparing the two groups on the rate of some exposure of interest (perhaps an environmental or genetic factor, perhaps a therapeutic intervention). If the cases have a statistically significant greater or smaller rate of the exposure than the controls, you would conclude that the exposure is / might be associated with the outcome. Case-control studies are often used in studying relatively rare health outcomes, and are mostly effective in testing for associations between outcomes and relatively common exposures. For instance, the link between lung cancer and smoking was established through a series of very large case-control studies (take a bunch of lung cancer patients, a bunch of otherwise similar people without lung cancer, compare the rates of smoking between the groups).

So, what would a case-control study of the relationship between hypnosis and IVF success look like? You'd start with a group of women with successful IVF pregnancies, you match them to an otherwise similar group of women without successful IVF pregnancies, and then you'd compare the rates of hypnosis administration during embryo transfer (ET) between the two groups. Obviously, such a study couldn't possibly work, because the exposure of interest (hypnosis during ET) is far too rare. Luckily, despite their assertions to the contrary in the title of their paper, the authors didn't actually conduct a case-control study. What they actually did was recruit a bunch of women to receive hypnosis during ET, find an otherwise similar group of women from their IVF practice who didn't receive hypnosis, and then compare success rates between the two groups. This puts the study design in a nebulous category somewhere between a controlled clinical trial (in which patients are assigned to receive an experimental or control therapy, and then outcomes are compared) and a cohort study (in which patients who do and do not receive an exposure of interest are followed over time to compare outcomes).

If pressed, I would describe this study design as a "clinical trial with matched historical controls". It's a suboptimal design in many ways, compared with a proper randomized controlled trial. (And, given the circumstances described in the paper, there's no reason whatsoever that an RCT couldn't have been conducted just as easily as this ad hoc
experimental design.)

The first problem is that the controls were subjected to "usual care," rather than a protocolized control treatment, which would have involved deliberately treating the controls as similarly as possible to the experimental subjects (including using the same IVF providers, the same equipment, at the same time of day, etc., etc.).

Second, and most obviously, the treatments weren't randomly assigned. All of the patients who consented for hypnosis were given hypnosis. None of the controls consented for hypnosis - either they refused, or it wasn't offered. There may well be (and apparently are, as we shall see) meaningful differences between women who are and are not willing to undergo hypnosis. Randomization would have meant comparing women who consented to undergo hypnosis and who received hypnosis to women who consented to undergo hypnosis but did not receive hypnosis. Apples to apples, not apples to oranges.

Also, there was no blinding to treatment assignment at any stage of the experiment. This would include during the process by which controls were being matched to hypnosis subjects. This presents a big opportunity for investigator biases to influence (either consciously or subconsciously) the selection of controls to be women generally less likely to have successful IVF. (And the controls did apparently have a lower baseline chance of success, as, again, we shall see.)

Selection bias, table for two (or more ETs)

A huge potential problem with this study (and the one that caught my eye first when reading the abstract) is that the "subjects" in the study (the "independent sampling units," in statistical terms) aren't women or couples, but cycles. Meaning that some women were followed for more than one cycle and each cycle was included in the analysis as if it represented an independent trial. My original concern with the study is that this violates the "statistical independence" assumption that underlies most of the basic statistical techniques (such as were used in the paper, described below). Multiple cycles from the same woman have more in common with each other than with cycles from other women, and failing to take this into account in data analysis can lead to completely invalid results. On closer reading, however, I don't think violation of independence poses such a huge problem here, since there were only a handful of women who contributed repeated cycles....

BUT, all those women came from the hypnosis group (98 cycles from 89 women), not from the control group (96 cycles from 96 women). What does that mean? It means that some of the women from the hypnosis group went through a cycle, failed to conceive, and were given a do-over. Maybe a few women from the hypnosis group were followed until they conceived, maybe the ones with the best prognosis. And this same privilege wasn't extended to any of the women in the control group. This is putting your thumb on the scale, big time. It's like comparing two cold remedies, following most people for one week, but following a handful of members of just one of the two groups until they got better. Obviously the group that gets the do-overs is going to have more successes.

If you don't know what your experiment is, does that mean you don't know how to analyze it?

Not necessarily, but it's certainly not a good sign! (I actually consider the incorrect description of the study, including in the title a much greater - almost unforgivable- failing on the part of the reviewers and editors than of the authors.) The data analysis is described in the paper in three short paragraphs:

Univariate analysis was performed using χ2, Fisher’s exact test, Wilcoxon matched-pairs signed-ranks test, and one-way analysis of variance test when appropriate.

To evaluate the effect of hypnosis during ET on pregnancy occurrence adjusted to the different confounding factors, logistic regression analysis was performed for the dichotomic dependent variable—pregnancy—with the independent variables found significant in univariate analysis, such as hypnosis during ET.

Statistical analyses were performed using Statistical Programs for the Social Sciences (SPSS, version 11.0, Chicago) software programs. P<.05 was considered statistically significant.

Ok, the first paragraph is a little like describing the operation of a motor vehicle as "application of the gas pedal, the windshield wipers and the horn when appropriate." There's no evidence here that they did anything terribly wrong, but there's no evidence that they did anything correctly either.

Fisher's exact test and Pearson's chi-square (χ2) test are both used to test the statistical significance of associations between categorical variables; Fisher's exact test is generally preferred when the two variables are both dichotomous (i.e., binary or two-valued) and when there are a small number of subjects who meet some combination of the two variables. It's impossible to tell when the authors chose one over the other, since they only report whether each test is significant, and don't include test statistics, degrees of freedom, or p-values. (A big no-no in scientific write-ups.) But, here's the main thing: since the samples were one-to-one matched, case to control, they could have (and should have) used a statistical analysis that accounts for the pairings. This would have been either McNemar's test, or a one-sample binomial test on the paired differences.

It's not at all unreasonable to expect them to know this, since they did claim to use a non-parametric matched sample test (the Wilcoxon signed-rank test), sometimes. Again, it's not clear when they used this test (which does not assume that variables are normally distributed) and when they used the parametric one-way ANOVA (which does). And this is another red flag of profound statistical ignorance: when comparing two paired groups on the mean level of a continuous variable (with an underlying normal distribution), the standard approach is a paired t-test. ANOVA is used to compare more than two groups on the mean of a continuous variable. And, while it is the case that an ANOVA conducted with two groups instead of more than two groups is mathematically equivalent to a t-test, there is no paired samples version of one-way ANOVA, and thus no equivalence between one-way ANOVA and a paired t-test. (The previous sentence was edited for general incorrectness; as commenter Efrique helpfully pointed out, there is a correspondence between a special case of two-way ANOVA and a paired t-test.) And no one competent in statistical analysis would describe a two-sample test as an ANOVA anyway, even in the circumstances where that would be technically correct. It's another tell.

The main analysis described is a logistic regression. This is an appropriate choice, but again, the authors should have and failed to take into account the one-to-one matching in the analysis (by means of a conditional logistic regression, for instance).

(Of course, how they matched 98 hypnosis cycles to 96 control cycles in a one-to-one fashion is anyone's guess, but they do claim to have done so.)

The results (?)

The hypnosis group had more successful pregnancies (53% of cycles) than the non-hypnosis group (30% of cycles). This is described as a statistically significant difference, although such a claim would presuppose an experimental design and analytic plan that allowed for valid statistical significance tests. A dubious presupposition, at best.

But far more interesting (and very telling) are some of the results comparing the baseline characteristics of the hypnosis and control groups (which, remember, were not randomly allocated):
  • The women in the hypnosis group had been infertile for an average of 4.7 years, but the women in the control group had been infertile for an average of 7.4 years (a statistically significant difference).
  • 47% of the women in the hypnosis group had primary infertility, compared with 74.2% of the women in the control group (a statistically significant difference). (Primary infertility means that the couple had never been able to conceive, as opposed to secondary infertility, meaning that the couple already has at least one child.)
  • 18.4% of the women in the hypnosis group had "unexplained" infertility, compared with 10.3% of the women in the control group (not a statistically significant difference).
All of these differences point to the same thing: more severe infertility among the control group than the hypnosis group. Probably more than enough so to explain any observed difference in pregnancy rates following embryo transfer. Some of these factors were included in the logistic regression model to try to control for differences but, as the old statistical maxim goes, "You can't fix by analysis what you screwed up by design."

In short, a fatally flawed paper, and some astonishingly derelict editorship from a legitimate journal.


Anonymous said...

Nice analysis of this study. You should submit it to a journal or paper to get more people to read it and invalidate this study. It sure looks like someone had something to prove rather than do real science.

Damian said...

Yes, excellent dissection. It's good to have a proper statistician onboard!

Efrique said...

I only found your blog today. Interesting!

I hope you don't mind if I play a little Devil's Advocate.

And this is another red flag of profound statistical ignorance: when comparing two paired groups on the mean level of a continuous variable (with an underlying normal distribution), the standard approach is a paired t-test. ANOVA is used to compare more than two groups on the mean of a continuous variable.

I would not call that "profound ignorance" by any stretch of the imagination. It's perfectly reasonable to regard two sample procedures as special cases of multisample procedures. It may be an usual use of terminology, but that in itself doesn't in any way invalidate the analysis. Ignorance of the usual terminology, at worst, but I wouldn't call it profound.

And, while it is the case that an ANOVA conducted with two groups instead of more than two groups is mathematically equivalent to a t-test, there is no paired samples version of ANOVA, and thus no equivalence between ANOVA and a paired t-test.

I guess this will be very upsetting for all those statisticians who have been using ANOVA to analyze randomized block designs all these years. Apparently they've been doing something that doesn't exist.

[Sure, it's not one-way ANOVA (obviously), but they definitely call it ANOVA. A paired t-test is equivalent to an ANOVA on a randomized (complete) blocks design with two items per block and two treatments.]

And no one competent in statistical analysis would describe a two-sample test as an ANOVA anyway, even in the circumstances where that would be technically correct.

I dispute the assertion in this case. Its perfectly possible to be competent (since that is exactly how I'd describe someone who is technically correct) even if they were unfamiliar with the common use of terminology. Terminology varies from area to area (sometimes to the extent that the same things are reinvented under different names).

Again, you might describe them as "ignorant" on that basis, I suppose , since what your describe is pretty standard, but not knowing the common terms isn't, of itself, what makes them incompetent.

js said...

Efrique, thanks for your comments. Excellent points. Let's see....

I think I really do consider the ignorance of terminology to be highly indicative of general statistical ignorance here. You're absolutely right that not knowing the standard terminology doesn't necessarily imply not knowing how methods work or how to apply them, but I consider it a very bad sign. If someone told me they'd made an astronomical discovery by looking through "the pointy metal apparatus with the lenses," I'd be highly suspicious of their claims. And I just don't consider it in the range of normal variation (if you will) among persons adequately educated in applied statistics to describe a two-sample t-test as an ANOVA in a context such as this paper. (I tried to make this distinction in the post, by describing the use of the term ANOVA for a two-sample test as a "tell," rather than conclusive proof of incompetence, but I may not have been clear enough about the distinction.)

You may be right, though, that "profound ignorance" was too strong. Perhaps in a different context, I would have taken it as more of a slip up. There were a number of other aspects of the paper that were "suggestive" of the authors just generally not knowing what they were doing.

And thanks especially for the correction about the equivalence between the paired t-test and the special case of two-way ANOVA you describe. Quite right, of course, and what I had written was misleading at best. (I was thinking of one-way ANOVA, as you suggested, but the statement of a lack of equivalence was far too categorical.) I'll edit the post to clear this up.