Wednesday, August 1, 2018

Submission to BJSM on Unreliable Data and Results in IAAF Science on Testosterone

Submission to BJSM on Unreliable Data and Results in IAAF Science on Testosterone
Along with Ross Tucker and Erick Boye, today I have submitted our re-vised paper on erroneous data and unreliable results on Bermon and Garnier (2017), the IAAF study which underpins its new testosterone regulations.

The full paper can be found here in PDF.

For additional background see here, here, here. I am happy to hear your comments or questions, here in the comments or on Twitter.

Friday, July 27, 2018

BJSM Lets Stand a Deeply Flawed Paper, Why?

BJSM Lets Stand a Deeply Flawed Paper, Why?
A few weeks ago the New York Times wrote about a paper we had submitted to the British Journal of Sports Medicine calling for Bermon and Garnier (2017, BG17) to be retracted. You can get the back story at the links in the previous sentence, but two things to understand up front:
  • BG17 is not just any old scientific paper -- it is the only scientific basis for regulations to be implemented by the International Association of Athletics Federation (IAAF) governing naturally occurring testosterone in female athletes.
  • Calling for a retraction of a scientific paper is not something to be done lightly. BG17 is the first paper that I have called on publicly to be retracted in 25+ years of publishing, reviewing, serving on editorial boards and studying science in policy. Yes, it is that bad.
Today the editor of BJSM emailed with the following information (which are quoted in full from his message):
1. The BJSM editorial team has considered the various points raised to us about retracting BG17 (including yours) and stand by our decision that retraction would be inappropriate. 
2. We respect the authors’ decision not to open these data even though we support the general principle of data sharing.
No retraction, no sharing of data.

When should a paper be retracted? Fortunately, the publisher of BJSM has a policy on retraction which states:
Retractions are considered by journal editors in cases of evidence of unreliable data or findings, plagiarism, duplicate publication, and unethical research.
This retraction policy is similar to the recommendation of the Committee on Publication Ethics (COPE), whose guidelines are followed by most scientific publishers (PDF):
Retraction is a mechanism for correcting the literature and alerting readers to publications that contain such seriously flawed or erroneous data that their findings and conclusions cannot be relied upon.
Why have we called for BJSM to retract BG17? Because of seriously flawed and erroneous data such that the paper's conclusions cannot be relied on. This is such a clear case that it is baffling why BJSM has chosen not only to let the paper stand, but to not require the paper's flawed data to be shared openly.

Why is the case so clear?
An editorial board should be so lucky as to have such a clear cut case. It's a no brainer. The message to BG17 should be: Sorry guys but this effort is so flawed that we are going to pull it. End of story.

So why did the BJSM editorial board act as they did? I have no insight on their internal deliberations, but given the retraction policy of the publisher of BJSM and the ethical guidelines suggested by COPE, there logically can be only three possibilities.
  • The BJSM editorial board disagrees with our analysis and the statement of the lead author of BG17 that there are pervasive errors underlying the original analysis. This would be a very odd position to take, as it is contrary to both evidence and the admission of the researchers who wrote BG17 and BHKE18.
  • The BJSM editorial board accepts that there are pervasive errors in BG17 and has decided to let the paper stand regardless. This too would be an odd position to take, as it is unethical and unscientific (according to COPE) and contrary to the retraction policy that BJSM is expected to follow. No scientific publisher worthy of the title would let flawed science stand. 
  • The BJSM editorial board is uncertain about the presence of pervasive errors in BG17 and in the face of this uncertainty has decided to let the paper stand. This would be an exceptionally odd position to take in light of the fact that BJSM has concluded (emphasis added), "We respect the authors’ decision not to open these data even though we support the general principle of data sharing." A really good way to understand the true depth of data errors would be for BJSM to require the authors of BG17 to release fully 100% of their data that has no privacy concerns.
So which is it?

The bottom line here is that BJSM has failed in its core scientific obligations. By all appearances BJSM is acting in the interests of IAAF and protecting IAAF research from normal scientific scrutiny.  I have no idea why this is so, but it is a subject that I'll continue to pursue.

Ross Tucker, Erik Boye and I will be revising our submission to BJSM and will ask to have it reviewed, published and linked to BG17. Obvious more to come, stay tuned.

(Note: This post represents my views only, though everyone is welcome to share them.)

Tuesday, July 17, 2018

A Deeper Dive into the Scientific Basis for IAAF Testosterone Regulations

A Deeper Dive into the Scientific Basis for IAAF Testosterone Regulations
This post represents some notes, references and quotes that I'd like to have accessible. Perhaps they are useful to others. It is technical and terse, but if you are following along on Bermon and Garnier (2017, BG17), it'll probably make sense. As always here, caveat lector.

The IAAF testosterone regulations focus on what is called "circulating testosterone" or "serum testosterone." These concepts as well as "free testosterone" are defined as follows by Goldman et al. 2017:
Total testosterone refers to the sum of the concentrations of protein-bound and unbound testosterone in circulation. The fraction of circulating testosterone that is unbound to any plasma protein is referred to as the free testosterone fraction.
T and fT are important variables in BG17. For reasons I do not fully understand (I welcome explanations), BG17 chose not to analyze athletes according to T values, but instead divides them into tertiles based on fT levels.
In order to test the influence of serum androgen levels on athletic performance, in each of the 21 female athletic events and 22 male athletic events, athletes were classified in tertiles according to their fT concentration.
There are two methodological issues/questions here:
  1. Why use tertiles rather than look at correlating all data in continuous fashion (discussed here)?
  2. Why use fT at all, as T is measured in the study and fT is calculated (a point raised by Sőnkson et al. 2018)?
There is no good answer to #1.

I welcome being educated on #2. 

However, there is an interesting twist here. In my reading and researching this topic, I came across a recent paper by David Handelsman (2017) on the use of "free testosterone" in clinical research. For those scoring at home, Handelsman is the lead author of the other new peer-reviewed paper (in addition to BG17) that is cited in the IAAF T regulations.

Handelsman (2017) says this about "free testosterone":
Despite being extant for decades, the use of FT measurement has barely been stated in a testable form and virtually never directly tested as a refutable hypothesis. Rather by inference and repetition as if it is self-evident, it has become entrenched as an enthusiastically wielded yet largely untested concept that goes from one paper to the next without ever seeming to pass through a questioning mind.
Hmm ... more:
A valid scientific concept requires a sound foundational theory and evidence and being open to testing and refutation not just an unshakeable belief. Introducing the FT into clinical guidelines is particularly hard to understand as using that unproven criterion can only provide false reassurance by merely shifting the uncertainty to an even shakier footing in a subtle bait-and-switch.
OK, so Handelsman is not a fan of "free testosterone" as a meaningful metric to be used in clinical guidelines. Got it.

Yet, fT is the basis for the binning of athletes that forms the very basis for the analysis and conclusions of BG17. Even if we were to ignore the clearly fatal data problems that we have identified, it would seem that one expert that the IAAF relies on has eviscerated the very basis for the study performed by the other experts that IAAF relies on. Even with perfect data, BG17 is problematic.

This provides yet another reason why BG17 doesn't form a legitimate basis for any regulatory action. At CAS, lawyers will surely have a great time asking Prof. Handlesman about his views on fT and its role in the methodologies of BG17.

But there is more. I discussed BHKE18 as a do-over, as it re-did the BG17 analysis after dropping some 220 data points. Important side note here: For comparison, BHKE18 says (emphasis added): 
We have excluded 230 observations, corrected some data capture errors and performed the modified analysis on a population of 1102 female athletes.
Below is my tabulation of data points in the new (BHKE18) and original (BG17), and the difference between the two representing what I've called here bad data. It appears that BHKE18 has miscalculated how many bad data points that it identified (220 vs. 230) between the two studies. Sure it may be a typo, but it is exceedingly sloppy to get wrong the number of bad data points you are reporting:

new original bad data
Event n n
100 m   96 112 16
100 m H 59 73 14
200 m 59 71 12
400 m 62 67 5
400 m H 52 67 15
800 m 56 64 8
1500 m 55 66 11
3000 m SC 49 56 7
5000 m 36 40 4
10 000 m 29 33 4
Marathon 86 92 6
Discus 36 48 12
Hammer Throw 42 54 12
Shot Put 42 54 12
Javelin 42 55 13
Long Jump 50 62 12
Triple Jump 41 54 13
High Jump 44 56 12
Pole Vault 39 48 9
20 km RW 80 97 17
Heptathlon 47 53 6
SUM 1322 220

In addition to dropping the bad data, BHKE18 re-does the analysis focused on T not fT in response to Sőnkson et al. 2018. In what sure looks like some kind of p-hacking, BHKE18 adopts some significant changes to their methodology (quotes from BHKE18 below, followed by my comments in italics):
  • BHKE18: "we used running events from 400 m up to 1 mile, on the basis that that is where T produces its greatest performance-enhancing effects"
    • This of course was a conclusion of BG17 and subsequently the focus of the IAAF regulations. With the bad data of BG17, there is no longer any evidence that these events are where T has the greatest effects. There is no basis here. This is affirming the consequent. 
  • BHKE18: "we have aggregated result from the long sprints (400 m events), then middle-distance runs (800 m and 1500 m), and finally long sprints and middle-distance runs (400 m, 800 m and 1500 m), into one event group for further statistical analysis."
    • No such grouping or aggregation was used in the original study. In fact BG17 said this: "These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required." So a complete reversal, which is it?
  • BHKE18: "The time results were transformed into an index, that is, percentage of the best performance achieved by each event"
    • The original analysis focused on absolute times, not a percentage index.
  • BHKE18: "We have used the Spearman rankorder correlation coefficient to explore the correlation between competition results and testosterone levels, using a two-sided test at the 0.05 significance level."
    • Here we see a new statistical test (a common one at that), based on ranked ordering of values rather than the actual values themselves.
  • BHKE18: "Finally, we used a serum T threshold concentration of 2 nmol/L to identify a group of female athletes with ‘high T’ levels, for comparison against the results of athletes with T levels of less than 2 nmol/L (‘normal T’ levels).
    • Where does 2 nmol/L come from? The IAAF regulations identify 5 nmol/L. 
It seems logical that if BHKE18 could have simply applied the same statistical methods of BG17 to the new dataset (minus the bad data) and obtained the same or similar results, they would have. The presence of so many methodological modifications is a big red flag.

BHKE18 clearly represents a do-over from a flawed study. But the authors double down and characterize it as somehow being an independent verification of the flawed study:
In conclusion, our complementary statistical analysis and sensitivity analysis using a modified analysis population shows consistent and robust results and has strengthened the evidence from this study, where we have shown exploratory evidence that female athletes with the highest T concentration have a significant competitive advantage over those with lower T concentration, in 400 m, 400 m hurdles, 800 m and hammer throw, and that there is a very strong correlation between testosterone levels and best results obtained in the World Championships in those events. A similar trend is also observed for 1500 m and pole vault events.
Thee results of BHKE18 are not the same as BG17, which were the basis for the IAAF regulations. From where I sit this situation looks like this:
  • BG17 relied on flawed data and questionable methods and arrived at a set of results that formed the basis for IAAF regulations;
  • When the data was challenged and errors identified, it seems logical that the methods of BG17 could not reproduce the results of BG17 using the new dataset (without the flawed data);
  • But the IAAF regulations had already been released, focused on four specific events identified based on BG17;
  • Altering the regulations based on errors in BG17 would of course mean admitting that BG17 was flawed in important respects, undercutting the basis for the regulations and IAAF;
  • So it seems that a considerable variety of new methods were introduced in BHKE18 that allowed the reduced-form dataset to plausibly approximate the results of BG17;
  • The regulations thus stand as written;
  • The new conclusions of BHKE18 are characterized as reinforcing BG17, giving at least a surface impression of a greater scientific basis for the regulations. In fact, the opposite has occurred.
This episode helps to illustrate why it is not a good idea to have an organization responsible for implementing desired regulations to be in charge of performing the science that produces the evidence on which those regulations are based. This would seem obvious, but has not really taken hold in the world of sports governance.

There is of course a need for BJSM to require the authors to release all data and code for both papers. One question that independent researchers will want to ask is how the results look when the methods of BG17 are applied to the data of BHKE18. Why were the methodological innovations introduced and what were their quantitative effects? This is the basic sort of independent check that makes science strong.

It is a fascinating case and no doubt has a few more twists and turns to come.It'll make for a great case study when all is said and done.

Thursday, July 12, 2018

A Call for Bermon and Garnier (2017) to be Retracted

The New York Times has a story just out on an analysis we've done on a recent IAAF study. Take a seat, this is a bombshell and these are my individual views on it.

Earlier this year, the IAAF announced new regulations governing natural testosterone levels in female athletes. One of the few academic studies that the regulations refer to is Bermon and Garnier (2017, hereafter BG17), conducted by two IAAF researchers and published in the British Journal of Sports Medicine.

Earlier this year several of us (me, Ross Tucker and Erik Boye) formally asked Drs. Bermon and Garnier to release their data (the part not involving private medical data) for purposes of independent replication. Dr. Bermon shared with us a subset of that data last week.

What the shared data shows is absolutely remarkable and has led to the three of us submitting a "Discussion" (here in PDF, as submitted except for a few typos fixed and page numbers added) to BJSM calling for BG17 to be formally retracted.

Here is what we wrote in that submission:
Due to the pervasiveness of problematic data we are calling for Bermon and Garnier (2017) to be retracted immediately by the authors and by BJSM. If a new analysis is subsequently completed and submitted for publication, we request that it be done so only with a full, independent audit of the underlying data and results by a team committed to keeping private the associated medical data. Further, upon publication, any such analysis should also in parallel publish performance data (i.e. not the medical data with privacy concerns) such that replication of this part of the analysis is possible by any independent scholar.

This case illustrates the importance of data sharing in science as well as the role of independent checks on data with policy or regulatory significance. We encourage BJSM to adopt immediately a more rigorous policy on data availability consistent with best practices among scientific publishers. Mistakes happen. Science is robust because they can be corrected
We identified 3 types of errors in their data:
  •  Duplicated athletes: more than one time is included for an individual. In each of these instances, more than one time from the 2011 and 2013 World Championships is included for the same athlete, contrary to the paper’s stated methods;
  • Duplicated times: the same time is repeated once or more for an individual athlete, which is clearly a data error;
  • Phantom times: no athlete could be found with the reported time for the event.
We also identified the inclusion of times from Russian athletes who had been disqualified due to doping. The Table below shows a summary of the number of problematic data points we found for four events in the BG17 analysis (400m, 400mH, 800m, 1500m).


We found between 17% and 33% problematic data in the four women's events and suggested that such errors may be present throughout other women's and men's data. This is unacceptable in a peer-reviewed scientific paper. Thus, we have called for retraction, as a matter of basic scientific integrity. It's not a difficult call.

Much to our surprise we subsequently learned that Dr. Bermon and three colleagues had published a new letter at BJSM just days before our submission (which I'll call BHKE18 hereafter). From all indications, BHKE18 represents a "do over" after they realized that they had serious data problems in the original work. 

BHKE18 unambiguously also confirms our identification of bad data. Just compare the number of data points included in BHKE18 versus BG17 shown in the graph below.

There are fully 220 data points eliminated from one analysis to the next, representing ~17% of the total. The elimination of data (which BHKE18 alludes to in passing as some double counting in BG17) clearly supports our critique.

And yet, the elimination of problematic data points still does not reconcile with our re-creation of the BG17 dataset for the four events that we looked at closely.

Data points for four women's events
BG17 BHKE18 PTB18
400 m 67 62 45
400 m H 67 52 48
800 m 64 56 53
1500 m 66 55 51

It appears that there remains problematic data. Further, the new letter is not peer reviewed, nor are its data publicly available for replication. By not being candid about their data errors in BG17, Bermon and colleagues have added confusion on top of confusion. This is not how science is supposed to work.

Mistakes are made, it is inevitable. What matters is what happens after that.

Here is what my colleague Erik Boye, Oslo University Hospital, says about this episode:
A set of data normally follows publications like BG17. The conclusions are linked to the data and their interpretation and the data must be made available to the general public. That is basic in science. If now the authors have received some help to understand that their data are fraught with errors they should call for a retraction and resubmit a new paper with new data if they so wish. We have pointed out this to the IAAF and to the publisher. None of them appear to handle this well. It is unacceptable that the paper stands and that a few people are informed that there were serious errors attached to the data and that unseen changes have been made to the data set. Furthermore, there is no sign that the new set of data has been subjected to any more of a critical review or that it will be released for external scrutiny.

For this reasons we should insist that scientific standards and rules are followed. In my practice at editorial boards (the EMBO and FEBS publications) I am certain that such a faulty data set would have released a demand for a retraction, with the possibility of a resubmittal.
I agree 100%. There is only one acceptable outcome here. BG17 must be retracted by BJSM. This could be done by a request from the authors or by BJSM itself. You do not get a "do over" in research when such pervasive errors are made. If Bermon et al. wish to submit another analysis for peer review, I'd expect that the data should be provided and a full audit done prior to publication.

By all indications neither BG17 nor IAAF intend to retract the paper. This says something about conflicts of interest in research, I would think. Thus, the ball is in the BJSM court. This will be a test of scientific integrity standards at BJSM. I hope they pass, for BJSM and for research integrity sake.

The IAAF analysis is far to important to be treated in such sloppy fashion. I'll be following up on the significance of the flawed data, IAAF's refusal to retract and what it might mean for the fate of the IAAF T regulations in days to come.

Tuesday, June 5, 2018

IAAF Opens Up on Testosterone: Some Reactions

IAAF Opens Up on Testosterone: Some Reactions
My experiences are that sports organizations rarely like to engage in public. However, this norm seems to be evolving, perhaps a motivated both by necessity and a by a newer commitment to engagement among forward-thinking sports administrators.

The IAAF, via one of its lawyers, Jonathan Taylor of Bird & Bird, has written a lengthy response to a Sports Integrity Initiative article on proposed new testosterone regulations. That Sports Integrity Initiative commentary can be found here. I am less interested about the back-and-forth than I am in what the IAAF response says about their proposed regulatory approach to testosterone regulation.

In this post I offer a few thoughts on the new IAAF arguments and applaud their commitment to public engagement. In that spirit, if Mr. Taylor or IAAF wish to comment here, I'm happy to host their views. Sport is better through such engagement, even (especially) when there is disagreement that can be clearly articulated.

Can IAAF Regulate Sport According to Athlete Biological Characteristics?

The answer here is clearly "yes."

Sports organizations routinely segregate athletes by biological characteristics, most obviously by weight classes in boxing and wrestling, and of course systematically in the Paralympics.

There are two logical fallacies here that are worth discarding up front, one typically advanced by opponents to T regulations and one advanced by IAAF in support of T regulations. They are:
  • Fallacy #1: Governing bodies do not (generally) regulate other "natural advantages" so IAAF cannot regulate T. 
  • Fallacy #2: Governing bodies do (sometimes) regulate other "natural advantages" so IAAF should regulate T.
The issue here is not going to be settled by invocation of general principles, but rather, the specific question of whether it is appropriate for IAAF to regulate women's athletics based on endogenous levels of T across four events.

What a big picture view can tell us however is that biological regulation of athletes in the disciplines of athletics is incredibly unique, and T would be the only biological characteristic that is regulated in all of the Olympic sport of Athletics. This fact does not determine an outcome, but it should set a high bar for approving any such regulatory action. 

Are the Male/Female Classifications in Athletics Regulated According to Biology?

This is tricky. The answer however is clearly "no."

The T regulations are an effort to regulate the male/female classification according to a biological characteristic. But at the moment, and for much of recent years, there has been no such regulation in place. Thus, the male/female is not regulated at present according to biological characteristics.
  • Do men and women have different biological characteristics? Of course
  • Are men typically faster and stronger? Of course
Male and female are genders with a strong, but not perfect, correlation with the biology of sex. One important reason for this imperfect correlation is that male and female are discrete categorizations in sport competition, whereas the biology of sex is not discrete. 

Taylor observes that the parties to the Chand case (2015, here in PDF) all agreed that it is appropriate to distinguish male and female classifications because males enjoy such a performance advantage that virtually all females would be excluded from elite competition. This issue need not be debated.

But none of this helps in resolving questions about T regulation. The challenge at hand is to determine eligibility for participation in male and female classifications. To state that males and females compete in different classifications is simply to set the stage. We should beware circular reasoning. 

Right now, society outside of sport does all the work of determining who is female and who is male for purposes of elite sports competition. This work embodies complex social processes that integrates considerations of biology, culture, politics, law and more into determining who gets classified as female and who as male. IAAF is not satisfied with how society is doing this work and is seeking to create its own regulations. (As a comparison, not long ago the sport of gymnastics decided that it was unhappy with how society was accounting for the ages of its athletes and so internalized age certification as a regulatory process.)

But make no mistake, at present neither males nor females are classified according to biological characteristics by the IAAF. That is what the proposed regulations are about. 

Does Science Distinguish Between Males and Females?

The short answer is "no." There is simply no single biological characteristic - chromosomes, hormones, whatever -- that uniquely and unambiguously distinguishes the biological sexes. This point is not particularly controversial, even by IAAF.

However, IAAF appears to be somewhat conflicted on this topic. The issues here are not male vs female, but female vs. female. This is clearly explained in the CAS Chand decision which (at 51) noted: "the Regulations do not police the male/female divide but establish a female/female divide within the female category."

The key issue here is not whether some females have biological characteristics more typically found among males, but whether those specific biological characteristics are clearly associated with a performance difference between females of a magnitude similar to those typically observed between males and females. Please read that sentence again.

In his response, Taylor introduces a biological characteristic into the debate that is not mentioned in the IAAF T regulations: testes. He writes:
the physical advantages enjoyed by male athletes are due to the fact that they have testes that produce testosterone in amounts that circulate in serum in the range 7.7 to 29.4 nmol/L, whereas female athletes have ovaries that produce much lower levels of testosterone, in the range 0.12 to 1.79 nmol/L. (RP: Based on a forthcoming, but not yet available paper by Handelsman et al.)
Males have testes, females have ovaries. OK, got it. Taylor then writes:
Due to conditions referred to as ‘differences in sex development’ (most often, 5-α reductase deficiency, or partial androgen insensitivity), an XY baby’s testes may not descend from the abdomen, so that it presents on birth with female or ambiguous genitalia, and so may be assigned the female sex. At puberty, however, the testes start producing the much larger levels of testosterone mentioned above, which (unless the XY female is completely androgen-insensitive) will have an androgenising effect on her body and will increase her circulating haemoglobin, in the same way as happens to an XY male at puberty.
So we have an individual with a "condition" called DSD ("differences in sex development") who has testes but "may be assigned the female sex." At puberty her body is responds the same way as an XY male. So is Taylor implying that she is actually a male (i.e., with testes)?

Taylor further writes:
the ‘natural physiology’ of most DSD athletes includes male gonads (testes) that produce levels of circulating testosterone not in the normal female range (0.12 to 1.79 nmol/L in serum) but in the normal male range (7.7 to 29.4 nmol/L), producing (if the athlete is not androgen-insensitive) lean body mass and levels of circulating haemoglobin well above the normal female range and rather in the normal male range.
The language here is important (and confused). "Male gonads" -- can individual body parts have their own genders? Can a woman have male body parts? Can a man have female body parts? If IAAF wants a gonad policy they should call it a gonad policy. The presence of gonads/testes is completely irrelevant in the proposed regulations, as it is focused on testosterone levels.

This issue is important because the proposed IAAF regulations stress that they are not seeking to classify athletes as male or female:
These  Regulations  exist  solely  to  ensure  fair  and  meaningful  competition  within  the  female classification, for the benefit of the broad class of female athletes.  In no way are  they  intended  as  any  kind  of  judgement  on  or  questioning  of  the  sex  or  the  gender  identity of any athlete.
Taylor's introduction of testes would seem to betray this claim. He writes:
If it is not fair and meaningful for a female athlete to have to compete with a male athlete whose gonads produce 10-30 times more testosterone than she does, so too it is not fair and meaningful for that female athlete to have to compete with a DSD athlete whose gonads also produce 10-30 times more testosterone than she does.
The IAAF regulations explain that a female athlete who does not meet the regulatory standard "will not be eligible to compete in the female classification in a Restricted Event at an International Competition" but would be eligible to compete in the male classification.

If this is not sex testing and classification according to physical characteristics, I don't know what is.

What about Performance?

Taylor's response emphasizes biological differences and says very little about performance or how it is related to testosterone (presumably because he was writing a response, so fair enough).. However, performance is absolutely essential to the IAAF case.

The CAS ruled against IAAF in the Chand case because the evidence available did not support the claim that high testosterone levels in certain female athletes were associated with a difference in performance between these women and other women that was similar to the difference between male and female.

CAS explained (527):
The Panel considers the lack of evidence regarding the quantitative relationship between enhanced levels of endogenous testosterone and enhanced athletic performance to be an important issue. While a 10% difference in athletic performance certainly justifies having separate male and female categories, a 1% difference may not justify a separation between athletes in the female category, given the many other relevant variables that also legitimately affect athletic performance. The numbers therefore matter. 
Because the performance numbers matter, levels of testosterone (or, unmentioned in the regulations, the presence of testes) are by themselves irrelevant. CAS judged that it is only if high levels of testosterone can be associated with a performance advantage of the order enjoyed by men over women that regulation might make sense.

CAS further explained (528):
However, in order to justify excluding an individual from competing in a particular category on the basis of a naturally occurring characteristic such as endogenous testosterone, it is not enough simply to establish that the characteristic has some performance enhancing effect. Instead, the IAAF needs to establish that the characteristic in question confers such a significant performance advantage over other members of the category that allowing individuals with that  characteristic to compete would subvert the very basis for having the separate category and thereby prevent a level playing field. The degree or magnitude of the advantage is therefore critical. 
This is where things get a bit sticky.

Upon receiving this judgment IAAF sought to commission research on the relationship of testosterone and performance. Rather than invite independent researchers to conduct such research, IAAF conducted it internally. This approach is clearly problematic because IAAF, as an interested party in the outcome, can hardly be called independent. Thus, IAAF handicapped itself from the outset.

The resulting research (much discussed on this blog) is Bermon and Garnier (2017). Not surprisingly, IAAF claims that its results support further regulation of testosterone. A close look doesn't really support this claim.

The most striking conclusion of this paper -- taking it at face value -- is that the resulting statistics come no where close to the 10% difference in athletic performance cited by CAS as an appropriate basis for regulation. In fact, the paper found no performance difference worth regulating in 19 of 23 athletic events in which women compete.

Think about that. After all of the talk of the overwhelming importance of testosterone to athletic performance, an internal IAAF study designed to look for such differences could not justify testosterone regulations for almost all women's events. Clearly, testosterone is not the magical athletic elixir claimed by some.

Of the four events that IAAF decided to regulate (400m, 400mH, 800m and 1500m), the Bermon and Garnier study found performance differences between the highest and lowest tertiles to be, respectively for each event: 1.5%, 3.1%, 1.6% and 0.3% (from Table 6). Only the first 3 were claimed to be statistically significant differences. These differences are similar to those that led CAS to suspend the original IAAF regulations at dispute in the Chand case, and far removed from 10%.

Given these numbers, it is surprising that IAAF has sought to again implement regulations that were previously unsuccessful at CAS. There seems to be no case here. Perhaps IAAF has some additional science in its back pocket.

Finally, on performance data, a last note. Along with Ross Tucker and Erik Boye, I have requested the underlying performance data of Bermon and Garnier. This is a normal request in research and should be expected of anyone who publishes peer reviewed research. Thus far IAAF has not released the data. This is deeply troublesome. We have engaged the journal's editor and will push this as far as it takes. As CAS explains, the numbers matter.

Bottom Line

It is very good to see IAAF (or its representatives) engaging in public. This is good for sport governance, for athlete rights and for the effective role of evidence in decision making. In this instance, I applaud Jonathan Taylor for his lengthy defense of the newly proposed IAAF regulations. He provides a further window into their basis and justification. They also raise some important issues worthy of further debate and discussion.

Monday, May 14, 2018

Wisdom on College Hoops from Carlon Brown

Wisdom on College Hoops from Carlon Brown
This series of comments below from Carlon Brown ((@carlonautentico) on Twitter offers a fantastic perspective on what college basketball prepares an athlete for and what it does not. Brown played at the University of Colorado and professionally in the US and overseas.

It'd be great to get him to my class next year. The perspective below is smart, have a read.











Saturday, May 12, 2018

Reverse Engineering Bermon & Garnier (2017)

Last week, Ross Ticker, Erik Boye and I wrote a letter to the British Journal for Sports Medicine calling for the authors of Bermon and Garnier (2017, BG17) and their sponsor IAAF to release the performance data used in their study. You can read our letter here.

Professor Joe Guinness, a statistician and visiting assistant professor at Cornell (@joeguinness) has attempted to reproduce the reported performance results in BG17 for the women's 800m, which we discuss in our letter.
BG17 report an average time of 121.80 seconds with a standard deviation of 5.42 seconds, for 64 times included in the analysis. Prof. Guinness sought to reporduce these numbers by brute force (his code is linked in the Tweet above).

He has found that he can only come close to reproducing the times by removing Caster Semenya's 2011 time plus that of one other athlete. See his results above. He notes in a Tweet: "There are some caveats here, especially how rounding is dealt with, so this shouldn’t be taken as definitive."

If these numbers are correct then it would mean that Caster Semenya's time was removed while 2 times from Mariya Savinova in 2011 and 2013 would remain. Savinova's times have officially been removed from the IAAF database after she was suspended for doping at both the 2011 and 2013 World Championships.

The inclusion of Savinova alone would call into question the meaningfulness of BG17, and the deletion of Semenya's time would be curious. Of course, we cannot be sure about any of this until IAAF and BG17 release their data.

The longer the stonewall the more questions will be raised as to why the just don't release the data. Are there some things in their work that they are afraid to show?

Wednesday, April 25, 2018

Some Resources on Testosterone Regulation in Elite Athletics

It appears that the IAAF is on the verge of announcing another set of regulations governing allowable natural testosterone levels in women athletes. This is a bad idea. In anticipation of the new regulations I thought I'd post up some resources for those who are interested in the issue.

The regulation of testosterone only the latest effort by sports administrators to police how women should look. There are countless biological characteristics of humans that in some way contribute to elite athletics performance -- testosterone in women (but not in men) is the only naturally occurring biological characteristic that is regulated.

I take on this issue in some depth in this paper
Pielke Jr, R. (2017). Sugar, spice and everything nice: how to end ‘sex testing’in international athletics. International Journal of Sport Policy and Politics, 9:649-665. (PDF, free to read)
Remarkably, in 2011 the IAAF listed a set of criteria for how women should appear, lest they be reported to officials for investigation of their testosterone level. These criteria are listed in the slide below (from a talk I give on this subject). Two of the nine criteria have to do with breats size and shape.
More generally, let's say that you accept the argument that testosterone should be regulated. I don't, but let's play along. Even here, the science relied on by the IAAF does not support the case that they are making.

The IAAF bases its case on this paper:
Bermon, S., & Garnier, P. Y. (2017). Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes. British Journal of Sports Medicine
That paper purports to show that women in certain events gain a benefit from testosterone levels in the higher end of the range found in female athletes. 

That paper has received a range of criticism as being flawed. Notably:
Franklin S, Ospina Betancurt J, Camporesi S What statistical data of observational performance can tell us and what they cannot: the case of Dutee Chand v. AFI & IAAF Br J Sports Med Published Online First: 23 February 2018. doi: 10.1136/bjsports-2017-098513
That paper concludes:
we believe that it is scientifically incorrect to draw the conclusions in the Bermon and Garnier paper from the statistical results presented. Their paper claims that certain athletes have an advantage in precisely the five events where a significant effect was found: we calculate that a high share of those five significant effects are likely to be false positives.
the statistical analysis data processing in this paper is such a mess that I can’t really figure out what data they are working with, what exactly they are doing, or the connection between some of their analyses and their scientific goals. 
Gelman was motivated by Simon Franklin, a post-doc at LSE, who emailed him that:
There are more than a few problems with the paper, not least the fact that it makes causal claims from correlations in a highly selective sample, and the bizarre choice of comparing averages within the highest and lowest tertiles of fT levels using a student t-test (without any other statistical tests presented).

But most problematic is the multiple hypothesis testing. The authors test for a correlation between T-levels and performance across a total of over 40 events (men and women) and find a significant correlation in 5 events, at the 5% level. They then conclude:
These are 5 events for which they found significant correlations! And we are lead to believe that there is no such advantage for any of the other events.
Female athletes with high fT levels have a significant competitive advantage over those with low fT in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault.
I also have written two critiques. First, a post-publication peer review:
My bottom line: The paper has some significant methodological issues, most notably the inclusion of female athletes who doped with those with naturally high levels of T. There is some double counting of athletes in 2011 and 2013. There is also speculation that the male findings are contaminated by doping. Methodological issues notwithstanding, the paper nonetheless strongly reinforces the 2015 CAS Chand decision. 

The IAAF data of Bermon and Garnier (2017) don't support the proposed regulations of testosterone in women at distances of 400m to one mile. Consider the figure below:
Let's accept the analysis as valid (maybe not, but let's play along). These IAAF data (pink bar) indicate that over distances of 400m, 800m and 1500m high testosterone women are on average 1.1% faster than their low testosterone counterparts. Unfair, IAAF might scream.

But look at the data that IAAF collected for men at 400m and 1500m (blue bar). These data indicate that high testosterone men are on average 1.1% faster than their low testosterone counterparts. Surely if high T in women in selected events where performance differs is to be regulated, then high T in men in selected events where performance differs is also to be regulated?

If IAAF responds that the T standard applies only to women but not men based on performance data, then this is the very hallmark of sex discrimination. This only scratches the surfaced of flawed T regulation.

We shall see what IAAF actually presents tomorrow. However, based on the evidence and arguments that IAAF have presented thus far, its T regulations are focused on one athlete (initials CS), discriminatory, sexist and (for those who think analysis of T levels in athlete performance is relevant) resting on a flawed evidence base.

There can be little doubt that this new policy will be challenged at CAS.

Monday, April 16, 2018

Six-Figure Salaries in the US NGBs Reported in 2016 IRS 990s

The figure above was motivated by a column by Sally Jenkins in the Washington Post a few weeks ago in which she reported that the USOC pays 129 staff members more than $100k per year. I was curious how that statistic looks for the 47 Olympic National Governing Bodies.

One way to take a look at that question is to dive into the 20016 (most recently reported) IRS 990 forms required for non-profits and sum up all the highly-paid employees reported on those forms. We identified 184 individuals on the 990s with compensation levels above $100,000. This number is surely an underestimate as not all such salaries are reported on the 990s. In addition, there are many subcontracts and transfers reported on the 990s to other non-profits or businesses for which it is impossible to identify salaries. US Soccer for instance, transferred some $60+ million and awarded USSF employees unspecified bonuses. Even so, the reported numbers tell us something.

Summary stats:

  • Number of salaries between $100k and $150k = 56
  • $150k - $200k = 46
  • $200k - $250k = 33
  • $250k- $500k = 39
  • >$500k = 10
  • >$1M = 3
All told, these salaries for the 184 employees total $44.7 million and represent just about 4% of the total NGB budgets.

Please send along comments, corrections and data requests via Twitter @rogerpielkejr.