Tuesday 12 January 2016

Data integrity and fraud detection

Lindsay Mitchell raises an interesting problem.

In 2011, the Department of Labour matched HLFS employment survey data with benefit data and found:
About 40% of people on work-tested benefits may not be meeting their labour market obligations, as they appear to be either working too much or searching too little.
The backstory Lindsay provides is excellent - it looks like the Ministry tried to bury the paper, and it came out later accidentally. Lindsay only got it with the Ombudsman's intervention. Go read her whole post. And she gives one plausible non-evil reason why the Ministry might have wished to bury it:
But imagine a beneficiary reads or hears about how a survey they are being forced to participate in is being checked against their Work and Income records. For the welfare abuser, that would merely tip them off to lie more consistently to government departments.

Data-matching is being used increasingly but its effectiveness lies in keeping the public in the dark. There's an irony at work. Non-transparency is required to improve integrity of systems.

So ultimately that's where I find the most convincing rationale. But that leaves me with a dilemma.

As a long-time critic of the welfare system, the findings vindicate or illustrate my concerns about the rampant misuse of the system (which hurts genuine beneficiaries and the taxpayers funding it). Do I want to make a song and dance about these findings though, if the information acts to assist those with the worst motivations?
So long as no beneficiary actually was punished for truthfully answering an HLFS or HES survey, the odds of contaminating future survey responses are lower.

The most paranoid end of the distribution would expect that the government has been doing this forever and so always would have lied; the least paranoid end would either expect that the government weren't competent to actually match up records, or that Stats NZ wouldn't be lying about the uses to which their data is put. Without actual cases of "I know a guy who told the truth on the HES survey and *bam* lost his benefit", I wouldn't expect huge effects - but I have low confidence in that expectation.

But I think there's a way around it.

First, link up IRD and HES/HLFS and MSD data from last year through the IDI, along with whatever other administrative data seems useful. Use the IRD and HES/HLFS data to establish true cases of fraud. Use the rest of the data to get the correlates of fraudulent receipt. If the data allows for a reasonable predictive model, great! Save the parameters for next year. If not, abandon.

Then, if the predictive model had been decent, use next year's administrative data to forecast which recipients are at higher risk of fraudulent receipt - and have MSD follow up the higher risk cases. Drop from the sample anybody who was an HES respondent - it'll be a pretty small number anyway. You'll then be pinging those recipients who are similar to last year's fraud cases, but you won't be hitting anybody who was one of the survey respondents. After enough of a lag, bring the prior year HES respondents back in - their back-end data should have changed sufficiently that they won't perfectly predict any more, so they are not being punished for having answered truthfully. They're being audited if their characteristics are still very similar to those of high risk cases.

It won't be perfect - there'll always be some who'll lie on the surveys, just in case. But would there really be many who'd start lying because of this procedure?

No comments:

Post a Comment