Tuesday 7 June 2011

Kanazawa [updated]

I've been a fan of Satoshi Kanazawa; I've long had regrets that I didn't know he was at Canterbury for the couple of months we overlapped there.

Not entirely unsurprisingly, he's gone and annoyed some folks. This time it's looking serious. The Add Health data series interviews high schoolers in three waves, making a nice panel data set. One question asks the interviewer to rate the respondent's attractiveness on a five point scale. Satoshi ran some regressions on attractiveness and found racial differences in means after correcting for possible confounds like weight; black women, but not men, were found less attractive in the surveys. He then speculated about whether testosterone levels might account for the result. His blog post, as usual, was pretty blunt about what he'd found; it's mirrored here as it's now been pulled.

The pile-on has been pretty brutal. He's been called a racist for finding data suggesting black women are less attractive than white or asian women; I'm not sure whether he's a racist also for finding data suggesting that there are no big racial differences in attractiveness among men.

Here's Huffington calling him a racist.

Lindsay Beyerstein is less than charitable in her interpretation of Kanazawa's stats. She gets the last wave of Add Health data and says that the difference disappears by Wave Four, raising troubling questions about Kanazawa's bias. I'd say rather more likely, he just had the first three waves' data sitting on his hard drive; getting the fourth wave would have been a pain in the arse for a blog post, so he just used the data at hand.

Hank Campbell is no more generous, with lots of snarky scare quotes about what factor analysis is. Because three interviewers rated respondent attractiveness at different points in time, you need to draw some summary stat out of the three observation. I'd have just gone with a straight average, maybe weighted towards the latter waves when the respondents were older. Kanazawa ran a factor analysis instead. The difference between the two isn't going to be great - factor analysis will try to extract some common underlying measure from the three observations, making the weights across waves endogenous. But Campbell likes to say 'factor analysis' with the scare quotes to make it seem dodgy.

The first thing I'd thought when I saw the controversy was that OK Cupid recently put up data noting that black women get far fewer messages from other OK Cupid members than they ought to; this was potentially consistent with Kanazawa's story (or with other equally plausible ones). But Campbell calls Hontas Farmer a racist for citing that data.

Now, Huliq reports Kanazawa's lost his blog slot at Psychology Today (one wonders how long Walter Block will last).

Scientific American wonders whether Add Health should be collecting data on interviewer-rated attractiveness:
I am disturbed by the fact that the Add Health study's adult researchers even answered the question of how attractive they rated these youth.
Never mind that a ton of research on kids' social capital would draw on measured attractiveness as a potential explanatory variable; apparently it's better to make things unknowable than to risk disturbing findings.

The Daily Mail insinuates that Kanazawa's a racist for his prior work suggesting IQ might be responsible for some poor health outcomes in Africa, and cites an LSE colleague calling for his firing.
It is not the first time that Dr Kanazawa, 48, a lecturer within the department of management at the LSE, has been accused of peddling racist theories.
In 2006 he published a paper suggesting the poor health of some sub-Saharan Africans is the result of low IQ, not poverty.
Professor Paul Gilroy, a sociology lecturer at the LSE, said: ‘Kanazawa’s persistent provocations raise the issue of whether he can do his job effectively in a multi-ethnic, diverse and international institution.
‘If he announces that he thinks sub-Saharan Africans are less intelligent than other people, what happens when they arrive in his classroom?’
He added: ‘The LSE risks disrepute if it fails to take a view of these problems.’
Here's Linda Gottfredson on IQ and health; here's Garrett Jones on IQ and economic outcomes.

Britton at Scientific American, linked above, raises a lot of better questions about whether Kanazawa's findings would stand up to more thorough investigation; so does Robert Kurzban. But it was a freaking blog post! Blog posts are where you put up initial data exploration and speculation to bat things around and see whether it's worth more thorough investigation. If you disagree with the analysis on a blog post, you write up your own post on why you think it was wrong or how it could be done better (like Kurzban); calling for Kanazawa's firing borders on witch hunt.

The LSE beclowns itself if it sanctions Kanazawa for this particular blog post.

I take more seriously Andrew Gelman's critique of Kanazawa's published work. It's fine to be wrong; it's a bit worrying that Kanazawa posted subsequently on that study without noting or addressing the critique (I'll take Gelman's word on this). [See below] As for how much weight I'd put on the soon to be released open letter of a bunch of psychologists castigating Kanazawa, well, I'd want to know what kind of Sneetches they are.

Michael Mills's "Seven Things Satoshi Kanazawa Cannot Blog About" is a must read...

Update: I've read Gelman's critique in more depth. Gelman's an excellent statistician. But some of the criticisms there lodged would apply to a reasonably high proportion of published empirical work. Endogeneity issues are everywhere; damning everyone who's ever had potential endogeneity / reverse causality problems in their published work would be a bit broad. And failing to adjust significance tests for the potential number of comparisons (as a guard against data mining) - I have a hard time thinking of many published pieces that have done that other than the metastudies that say we can't trust any empirical work.

Gelman's specific (and not at all unreasonable) worry on datamining is that Kanazawa's work on whether more attractive couples have more daughters tests whether the most attractive couples have more daughters than all others; equally plausible would be tests of whether the least attractive couples had the fewest daughters, the top two categories of attractive couples against the rest, and so on. XKCD summarized the problem here. But subsequent work with a different data set found the same result; matching the prior paper's result via datamining would then have taken mining across different datasets until finding the one that gave the best match, and I'm not sure there are all that many datasets that include attractiveness data.

Finally, Gelman (2007) critiques Kanazawa's earlier (2005) work for missing that there are potential problems in using number of daughters on the right hand side of a regression equation and number of sons on the right if some couples use a stopping rule that aims at particular ratios. But Kanazawa's 2007 piece recognizes that issue. I'm not sure whether the prior pieces' results were sensitive to this specification issue, but I'm also not sure it's right to say, as Gelman implies, that Kanazawa then went on to do other work without taking due account of critics' view of prior work.


  1. I don't quite see how Kanazawa can be called racist for presenting data. Data just is, in its raw form it can't be racist. If it later comes to light that he manipulated the data or its interpretation in such a way as to imply that african american women are considered less attractive then he could quite rightly be given the racist tag. Until then we have to presume he is simply presenting an unusual finding. Simply because findings such as these are unpalatable to some certainly does not make the researcher inherently racist. There is a long history of scientists being at odds with the establishment (refer Galileo, Darwin, etc.) and being able to challenge long held beliefs, or social norms, is part of what makes science so valuable.

    However, the assumption that this is caused by elevated testosterone levels in african americans appears to be untested, and he may have been unwise to raise this as a possible reason. I doubt any woman likes to be referred to as mannish.

  2. I have to confess that Kanazawa's findings on attractiveness mesh *perfectly* with my own assessments of physical beauty among different races. I've always thought black men are quite attractive and black women rather unattractive, precisely because both strike me as masculine. (Of course these assessments are only averages. There are many black women I consider attractive and plenty of black men I consider unattractive.)

    On the other hand, I have the reverse assessment for east Asians: attractive women and comparatively unattractive men because both strike me as more feminine than average. Kanazawa's findings don't support that.

  3. Well, the attractiveness measure in the original study is, if I remember correctly, a Likert-type scale, which is ordinal according to textbooks and metric according to how it's usually analyzed. When I first read the paper I thought, "oh, I guess the result wasn't significant when you use the measure properly (i.e., without dichotomizing much of the variance out). The author is not being honest with us." I read Gelman's criticisms concerning "multiple comparisons" as the polite way of saying the same thing. I mean, you wouldn't go and complain about the absence of multiple comparisons if Kanazawa had used the original linear variable, would you?

    When authors dichotomize the dependent or main independent variable, I almost always suspect they're trying to cheat a significant result out of the data. The last time this happened was this morning.

  4. @Lemmus Sure. But he found the same thing in two very different datasets running the same dichotomization, if I read things correctly.

  5. Not at all!

    From the original article (p.136):

    "At the conclusion of each in-home interview, the interviewer is asked to rate the respondent’s physical attractiveness on a five-point ordinal scale (1 ¼ very unattractive, 2 ¼ unattractive, 3 ¼ about average; 4 ¼ attractive; 5 ¼ very attractive). I use this fivepoint scale as a measure of respondents’ physical attractiveness.

    "It is interesting to note that, across the entire sample, the objective measure of physical attractiveness that I use in the analysis below is very weakly, albeit statistically
    significantly (due to the large sample size) correlated with the self-rated 4-point scale of physical attractiveness (1 ¼ not at all, 2 ¼ slightly, 3 ¼ moderately, 4 ¼ very)(r ¼ 0.0973, po0.0001, n ¼ 14,760). More than a quarter (28.2%) of the respondents rate themselves as ‘‘very attractive,’’ while only 11.2% of them are so rated by the interviewer.


    "3.4.1. Bivariate analysis

    "Fig. 1 shows the proportion of sons among the first children of Add Health respondents by their physical attractiveness. It is immediately obvious that the proportion of sons among the four lower classes of physical attractiveness (0.50 for ‘‘very unattractive,’’ 0.56 for
    ‘‘unattractive,’’ 0.50 for ‘‘about average,’’ and 0.53 for ‘‘attractive’’) stay very close to the population average of 0.5122 (105 boys per 100 girls). The proportion among the ‘‘very attractive’’ respondents (0.44), however, appears substantially lower. A one-way analysis of variance shows that the proportion of sons and physical attractiveness are not statistically independent (F(4, 2965) ¼ 2.55, po0.05).

    "If I dichotomize the respondents into those who are rated ‘‘very attractive’’ and everyone else, the difference in the proportion of sons between the two groups (0.52 vs. 0.44) is statistically significant (t ¼ 2.44, po0.05). There appears to be something qualitatively different about respondents rated ‘‘very attractive’’."

    The multivariate analyses are then performed with a dummy that is 1 if the interviewer rated the respondent as "very attractive" and 0 otherwise.

    From the second article (p. 354-55):

    "At age 7, the teacher of each NCDS respondent is asked to describe the child’s physical appearance, by choosing up to 3 adjectives from a (highly eclectic) list of 5: ‘‘attractive,’’
    ‘‘unattractive,’’ ‘‘looks underfed,’’ ‘‘abnormal feature,’’ and ‘‘scruffy and dirty.’’ From these 3 responses, I create 2 dummies. Attractive ¼ 1 if the child is described at all as attractive,
    0 otherwise. Unattractive ¼ 1 if the child is described at all as unattractive, 0 otherwise. In all, 84.3% of the children are described as attractive, while 11.7% are described as unattractive. Because the 2 dummies Attractive and Unattractive are mutually exclusive and nearly exhaustive, any effect I may find for one is likely a mirror image of that of the other."

    So, in the first study it's the top 11.2% in the "very attractive" category (vs. everyone else) who have more daughters and "there's something qualitatively different" about those people; in the second one it's the top 84.3% ("attractive") (vs. everyone else) who have more daughters. That's not the same thing. You could sort of replicate the second study using the data from the first, creating as your attractiveness measure the top four vs. the "very unattractive" category, but his description of the data strongly suggests that that comparison would yield no significant differences.

    Moreover, it appears that the longitudinal dataset that he uses in the second study contain measures of attractiveness during adolescence (p. 356). If I understand the theory that he tests correctly, these are much more relevant than attractiveness during childhood. This looks very much like another case of pick-and-choose.

    In case anyone else is still reading this thread, I should mention that both fulltexts are ungated and easliy googled.

  6. Aw, shucks, my comment isn't there. I would like to blame Blogger, but it was late and I do remember closing some tabs. So, here's a short version, without quotations.

    It is not the same dichotomization at all. In the 2005 paper, the Likert-tpyoe measure is dichotomized; Kanazawa is quite clear about doing the dichotomization on the basis of the bivariate results rather than theory. A little over ten percent are classified as "very attractive", compared with everybody else, and found to have significantly more daughters.

    In the second study, all subjects that were classified as "attractive" by their teachers at age 7 are hypothesized and found to have significantly more daughters (first children only) at age 47.

    So, these are two very different measures. You could sort of replicate the findings from the second study using the first dataset by dichotomizing the attractiveness measure differently. My prediction is that you wouldn't find a significant result.

    Moreover, according to the underlying theory, attractiveness during childhood *per se* is uninteresting, what you really would like to know about is attractiveness at fertile ages (presuming that parents' attractiveness during fertile ages is a better predictor of offspring attractiveness during fertile ages). The second article states that the study used did measure subjects' attractiveness in adolescence, but this data was not used. That looks like another case of pick-and-choose.

  7. I'd have to look at the second paper again; if all the second paper had was a bivariate measure but the proportion deemed attractive weren't far out from the proportion in the first paper, it's not pick and choose. I don't know though why data from later ages wouldn't have been used; that seems odd unless there's some reason to expect childhood attractiveness correlates more strongly with attractiveness in adulthood than do the gangly adolescent years.

  8. Yes, the second paper's measure pretty much came as dichotomous. By pick-and-choose, I was referring to the use of the age 7 variable over the others.

  9. @Lemmus: Your original comment just turned up in the spam filter.

    I will need to read both papers in more depth once I'm back from North America.

    I'm having a harder time seeing the big difference in findings though. In the first paper, the top tier in attractiveness had more daughters relative to the middle and bottom. In the second, the bottom tier (and a tiny middle tier that's effectively bottom too if 85% of the sample got a tick-mark for attractive) had significantly more sons relative to the upper (and what would have been middle if graded similarly to the first study).

    Was there any good reason given for picking the younger cohort measure?

  10. Well, Blogger moves in mysterious ways.

    As for your last question, not as far as I remember - he discusses these in terms of the validity of the measures he actually uses, saying that they correlate positively and substantively with the later measures. I should add as a caveat that I only gave the paper a quick onscreen read, so I may have missed something.

  11. The best answer he could have given would have been sample drop off ... if lots of respondents were caught in the first wave but not the second. Will need to check.