No, Scientists Have Not Found the ‘Gay Gene’. The media is hyping a study that doesn’t do what it says it does.
This week, a team from the University of California, Los Angeles claimed to have found several epigenetic marks—chemical modifications of DNA that don’t change the underlying sequence—that are associated with homosexuality in men. Postdoc Tuck Ngun presented the results yesterday at the American Society of Human Genetics 2015 conference. Nature News were among the first to break the story based on a press release issued by the conference organisers. Others quickly followed suit. “Have They Found The Gay Gene?” said the front page of Metro, a London paper, on Friday morning.
Meanwhile, the mood at the conference has been decidedly less complimentary, with several geneticists criticizing the methods presented in the talk, the validity of the results, and the coverage in the press.
Ngun’s study was based on 37 pairs of identical male twins who were discordant—that is, one twin in each pair was gay, while the other was straight—and 10 pairs who were both gay. He analysed 140,000 regions in the genomes of the twins and looked for methylation marks—chemical Post-It notes that dictate when and where genes are activated. He whittled these down to around 6,000 regions of interest, and then built a computer model that would use data from these regions to classify people based on their sexual orientation.
The best model used just five of the methylation marks, and correctly classified the twins 67 percent of the time. “To our knowledge, this is the first example of a biomarker-based predictive model for sexual orientation,” Ngun wrote in his abstract.
The problems begin with the size of the study, which is tiny. The field of epigenetics is littered with the corpses of statistically underpowered studies like these, which simply lack the numbers to produce reliable, reproducible results.
Unfortunately, the problems don’t end there. The team split their group into two: a “training set” whose data they used to build their algorithm, and a “testing set”, whose data they used to verify it. That’s standard and good practice—exactly what they should have done. But splitting the sample means that the study goes from underpowered to really underpowered.
There’s also another, larger issue. As far as could be judged from the unpublished results presented in the talk, the team used their training set to build several models for classifying their twins, and eventually chose the one with the greatest accuracy when applied to the testing set. That’s a problem because in research like this, there has to be a strict firewall between the training and testing sets; the team broke that firewall by essentially using the testing set to optimise their algorithms.
If you use this strategy, chances are you will find a positive result through random chance alone. Chances are some combination of methylation marks out of the original 6,000 will be significantly linked to sexual orientation, whether they genuinely affect sexual orientation or not. This is a well-known statistical problem that can be at least partly countered by running what’s called a correction for multiple testing. The team didn’t do that.
Story is here.