![]() |
Page History
...
- Questions for group were how to pick a threat model, which identifiers to be concerned about, and how to establish a risk threshold for public data release.
- Apply stratification principles to structured data. If you have unstructured data, structure it first.
- Identity disclosure, which is just one type of disclosure but the type most applicable to re-id, is when a person's identity is assigned to a record.
- Trying to measure the risk of verification for a dataset
- Quasi-identifiers are those known by an attacker.
- Delete or encrypt/hash direct identifiers first. What we end up after that is synonymous data.
- For the purposes of re-id risk, we only care about quasi-identifiers.
- A meaningful re-id teaches you something new about the person.
- Attack in two directions - population to sample, sample to population
- Risk is measured by the group size (of 1 = unique)
- Assign a risk value to each record in the dataset.
- To reduce the risk, you can generalize the records and reduce the match rate.
- You can suppress records, remove records, and add noise to reduce the risk of re-id as well.
- generalize - group size gets bigger - risk reduces - maximum (k-anonymity)(public), average (non-public), unicity (proportion of records that are unique in the population)
- You don't want to measure the risk in the data set but measure the risk in the population. The data set is just a sample from the population.
- The group size in the population is the number that's important, but you have to estimate it, since you don't usually have a population registry.
- Once you can estimate the risk properly, you can manage risk in a less conservative way that is still defensive.
- There's no such thing as a probability of zero.
- For releasing public data, a threshold in popular use today is .09. This will give you higher data quality. For particularly sensitive data sets, you would use the more strict threshold of .05.
- risk denominator is not group size in sample but in population
- risk threshold in identifiability spectrum
- privacy-utility trade-off
- data transformations - generalization, suppression, addition of noise, microaggregation
- for non-public data, can add controls (privacy, security, contractual) to deal with residual risk.
- motivated intruder attack-empirical way to evaluate your risk. Commission a white hat attack.
- Two approaches for risk assessment: 1) model-based 2) motivated intruder attack.
- Useful for public data releases. Helps find quasi-identifiers you didn't consider.
- For public data releases, it's harder to release complex data sets and still retain utility.
Two publicly available tools:
...
Discussion
- Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
- A: You can estimate all of these things. The methods I described can be used to estimate those values.
- Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
- Worry that we underestimate populations and destroy data more than we need to.
- A: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
- We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
- A: There's a buffer there. Only in a quarter of cases are you able to validate a match.
- Data quality issues
- Data sets that are published have errors in them
- When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
- Risk is below the threshold in practice.
- How does it work well in practice?
- A: When motivated intruder attacks were done on properly de-identified data, nothing was found.
- A: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
- Are there off-the-source tools one can use to do this–estimating population and computing risk?