Page History

...

Questions for group were how to pick a threat model, which identifiers to be concerned about, and how to establish a risk threshold for public data release.
Apply stratification principles to structured data. If you have unstructured data, structure it first.
Identity disclosure, which is just one type of disclosure but the type most applicable to re-id, is when a person's identity is assigned to a record.
Trying to measure the risk of verification for a dataset
Quasi-identifiers are those known by an attacker.
Delete or encrypt/hash direct identifiers first. What we end up after that is synonymous data.
For the purposes of re-id risk, we only care about quasi-identifiers.
A meaningful re-id teaches you something new about the person.
Attack in two directions - population to sample, sample to population
Risk is measured by the group size (of 1 = unique)
Assign a risk value to each record in the dataset.
To reduce the risk, you can generalize the records and reduce the match rate.
You can suppress records, remove records, and add noise to reduce the risk of re-id as well.
generalize - group size gets bigger - risk reduces - maximum (k-anonymity)(public), average (non-public), unicity (proportion of records that are unique in the population)
You don't want to measure the risk in the data set but measure the risk in the population. The data set is just a sample from the population.
The group size in the population is the number that's important, but you have to estimate it, since you don't usually have a population registry.
Once you can estimate the risk properly, you can manage risk in a less conservative way that is still defensive.
There's no such thing as a probability of zero.
For releasing public data, a threshold in popular use today is .09. This will give you higher data quality. For particularly sensitive data sets, you would use the more strict threshold of .05.
risk denominator is not group size in sample but in population
risk threshold in identifiability spectrum
privacy-utility trade-off
data transformations - generalization, suppression, addition of noise, microaggregation
for non-public data, can add controls (privacy, security, contractual) to deal with residual risk.
motivated intruder attack-empirical way to evaluate your risk. Commission a white hat attack.
Two approaches for risk assessment: 1) model-based 2) motivated intruder attack.
Useful for public data releases. Helps find quasi-identifiers you didn't consider.
For public data releases, it's harder to release complex data sets and still retain utility.

Two publicly available tools:

...

Discussion

Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
A: You can estimate all of these things. The methods I described can be used to estimate those values.
Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
Worry that we underestimate populations and destroy data more than we need to.
A: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
A: There's a buffer there. Only in a quarter of cases are you able to validate a match.
Data quality issues
Data sets that are published have errors in them
When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
Risk is below the threshold in practice.
How does it work well in practice?
A: When motivated intruder attacks were done on properly de-identified data, nothing was found.
A: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
Are there off-the-source tools one can use to do this–estimating population and computing risk?

Content

Space Tools

Versions Compared

Old Version 6

New Version 7

Key