Thursday, 8 June 2017

Open data - Free the CURFs

Koordinates hosted a fun event yesterday on open data.

Ed Corkery, Koordinates CEO, opened with a talk on the potential of open data in GSS-space, once it's actually opened up in useable form. Too much data winds up being inaccessible or unusable; Koordinates works to try to make that data more easily used.

Statistics NZ's Government Statistician, Liz MacPherson, presented next. She has a great vision for where open data should be at. We're not there yet, but I like the kind of future she's describing. In that world, IDI users share all of their code. That means that you don't just get replicability and a lot more potential for error-catching, you also get standardised bits of code that can get dropped into projects. So if one person's already run the code that matches, for example, students' NCEA records to later income tax data, somebody else can just grab that bit of code rather than have to re-create it. There's recognition of that Stats needs to be careful not to break its current social licence as a trusted repository of data, but that there are ways of doing that while also being far more open than Stats has been.

It's work in progress, with all kinds of real technical hurdles. Antiquated back-end systems generate reports that turn into the current tables, with dependencies all over the place, making it tough to shift towards the more flexible and dynamic environment that would allow cross-tabs to be generated on the fly. Get far enough into that world, and you don't even need Confidentialised Unit Record Files any longer. Instead you can get privacy and confidentialisation on the fly that scales confidentialisation to the risk of deanonymisation given the kind of data being extracted.

But things are moving.

One easy thing that Statistics NZ could do, as an interim measure while everything else is going on, would be to open up the CURFs. And that's a good chunk of what my talk focused on. I just can't see any good reason that these things are still locked up behind difficult access barriers when America's PUMS are all available to anyone in the world who has a browser.

Liz described me in her talk as one of Stats' NZ's most vigorous critics. But I love Stats NZ. I just get frustrated that the CURFs have been locked up forever, and that what we're able to do here is so far behind what can be done with American data because of the access controls. And that's especially frustrating when so much NZ data held in IDI is so much better than that which can be done with American data. It'll take time to sort things out on the back-end so that we can get front-end interfaces that match what IPUMS is already doing in American data (and that Berkeley's SDA engine has been doing for ages now), and some of that is unavoidable where there are resource constraints and a pile of old systems that need sorting out.

But why not open up the CURFs in the interim? It would also help signal the change in approach at Stats, in line with the Government Statistician's vision of real open data.

Flip the switch! Free the CURFs! And, just to be on the safe side, put in a big real penalty for anybody who takes the CURFs and uses them to re-identify individuals.

Update: Oh My.

No comments:

Post a Comment