Response to critics – Mapping the origin of Indo-European

Below is a copy of an email (14th Feb 2013) we sent to the authors of the GeoCurrents blog, whose webpage and youtube video about our research contains a number of errors. In the letter we correct the major errors and call for a higher standard of academic debate on the relative merits of different lines of evidence for the origin of Indo-European. The authors of GeoCurrents said they would respond to our points on their website, but did not want to publish our letter to them in full. We therefore provide a copy of our letter below.

Dear Professors Pereltsvaig and Lewis,

We composed this email following your recent posts on your homepage and the talk you gave at Stanford (posted on Youtube), which criticise our Science paper on Indo-European. Our hope was to initiate a more productive debate that will move the field forward. In the last day or so, your email exchanges with Russell Gray suggest that we are already heading in that direction. Nevertheless, we still want to highlight a number of points we think are important.
In your talk, you draw attention to supposed errors or oversights in our methodology that you claim render it invalid. We welcome this level of scrutiny, however, many of the points you make misrepresent our paper and stem from misunderstandings of our data and model. To summarize the most important of these:

1. It is wrong to say that we did not distinguish cognates from borrowings. The lexemes in the dataset were compiled carefully according to the principles of the comparative method, with information recorded in every case where the cognate judgement came from. The provenance is supported by a reputable published historical linguistic source wherever possible. Where this was unavailable it was usually possible to make a judgement supported by evidence near sister languages. Removing identifiable borrowings is part of this process.

We made use of the Dyen, Kruskal and Black wordlists to get the database started, but these required extensive recoding (about 26% of these lexemes ended up in different sets than given by Dyen at al. – although it’s hard to count precisely, since splitting a cognate set of n+m into n and m could be counted as n or m changes).

2. It is wrong to say that we do not distinguish innovations from retentions. This is precisely what our model DOES do, and is a major advantage over previous distance-based approaches.

3. It is wrong to say that the paper was not reviewed by linguists or that they were ignored. The paper was reviewed by two linguists (as far as we can tell). Incidentally, they also raised the above two points, which we were able to respond to by inserting additional explanation in the SOM and highlighting previous papers where we address these issues in more detail.

4. Errors in the placement of individual languages suggest our model could do better, but do not undermine our main conclusions. The key question is whether any model assumptions, data considered (or not) or misplacements affect the main claims of our paper.

We hope to respond to the above points in more detail in an appropriate peer-reviewed forum.

A second purpose of this message was to express our concern at the nature of your attack on our paper and our integrity. We feel strongly that the ad hominem attacks on our motivations and integrity as scientists are inappropriate and counter-productive. Accusations like “worse than creationism” and “pathological rationalism” sound like someone on a crusade of their own, not the objectivity that we aspire to as scientists.

We are not in any way committed to one particular outcome or other. What we have been trying to do over the years is bring together new data (lexical and structural) and novel methods (not all of which assume a tree) to shed light on important questions in historical linguistics. We don’t expect our paper to be the last word on the question of Indo-European origins. If future modeling work turns out to conclusively support a steppe origin, we’d be as happy as anyone about that, and very keen to understand exactly what it was that made the difference.

To this end, we note that the bulk of your comments highlight areas where additional data or constraints could be imposed on the analysis. Personal attacks aside, this is exactly the kind of reaction we were hoping for and we are currently working to test the effects of these new data and modeling assumptions.
One happy outcome of the often ridiculous publicity our work received was that others are now working on this problem too – we know of several groups around the world. For example, there are new models that can infer cognates directly from lexical data (see this week’s PNAS early edition), there are new models that incorporate borrowing and ancestral states, and a new Sogdian dataset has been made available. We were unable to get Sogdian data at the time we were working on the paper. Subsequent to publication (and directly following the interest that publication aroused) Sogdian experts have made data available to us, and we look forward to running a reanalysis with this data included. Given the robustness of the analysis so far, we think it is unlikely that the main results will change, but we would be happy to be proven wrong. Our lab has also been working on adding other languages and adding complexity to the model, including migration along rivers – an extension which you mention in your talk.

It is clear that, like us, you think the question of Indo-European origins is an important one. And like us, you are passionate about using the data and tools we have available to us to draw inferences about human origins. We clearly have different opinions about how convincing conflicting lines of evidence are (we are not convinced by the linguistic palaeontology argument), but that should not stop us from engaging in constructive debate.

Yours sincerely,

Quentin Atkinson, Michael Dunn, Simon Greenhill and Russell Gray