NIH | National Cancer Institute | NCI Wiki  

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

January 11, 2022 Meeting

Interim Report Best Practices And Recommendations Extract as of 20220107

  • Fred asked if the report should be focused on the US given that the details can differ geographically.
  • Data created from European persons may not satisfy GDPR.
  • We should highlight when this is true, along with caveats and any possible workarounds.
  • Fred likes the ideas of universal guidelines to recommend to the EU.
  • We will share the report with international colleagues once it the report is fleshed out.
  • California regulations exclude healthcare data. 
  • Is it fair to focus on ethical and moral concerns as well as the legal concerns? We're trying to reduce the actual re-id risk and harm.
  • So far we're focused on DICOM images.
  • Kathy: Say anything about raw data signals?
  • Wyatt: DICOM SR objects and embedded PDFs? Non-image objects, RT plans.
  • Need a more precise definition for unrecognized. It is the opposite of "what is known to be safe."
  • Specify what constitutes due diligence as you conduct your risk analysis. Can't help the unknown unknowns.
  • Make the definition of collection clear. Collection doesn't communicate "version."
  • "Release" not as good as "collection."
  • "Indirect" and "direct" identifiers, sensitive information–a disease that may make someone discriminate against you or function as an indirect identifier.
  • Ideally, you'd want to quantify the percentage of data elements you will be retaining.
  • The paper will highlight the uncertainty.
  • Steve: Address optional attributes as well.
  • Calibration information can identify the machine used.
  • Consistency of acquisition protocols.
  • Need to consider and determine which options to the profile are selected. 
  • Part 15 and best practices are different.
  • Only got through item 6 in the Summary of Best Practices. Will pick this up at the next meeting. To save time, team members can send David their comments in writing.

Action: Review the Interim Report and email David Clunie your comments.

February 8, 2022 Meeting

Interim Report Best Practices And Recommendations Extract as of 20220208

ITEM 6

  • A paper in BMJ and Trials [Hrynaszkiewicz et al] in which the editor said it's okay to keep three patient characteristics, but more than that requires expert statistical analysis of re-identification risk.
  • For this report, David C. is leaning towards saying that if any characteristics are retained, a statistical analysis should be performed, based on evaluation in IOM Report Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk that suggest no empirical basis for rule of two or three quasi-identifiers (Appendix B  [El Emam & Malin]).
  • Fred asks what the risk threshold we are comfortable with. David says we need the analysis first, then compare that against the risk threshold.
  • David recommends this for the report: Choose a risk threshold, do the analysis, and modify/share your data based upon that analysis.
  • Is this a reasonable recommendation given that none of us are doing the statistical analysis routinely?
  • David Gutman: Studies that leave out age and sex are not interesting.
  • Some TCIA collections are useful even without age and sex.
  • HIPAA Safe Harbor is only useful in the US. We are trying to do more than these 18 elements when it comes to de-id.
  • In radiology we have traditionally just relied on lists.
  • Information can be derived or approximated from images. If pixel data render the data unique, change it to make it less recognizable or delete it. The analysis can lead you to a decision on how to handle this.
  • We would be better off recommending this and then over the next several years, hopefully there will be more research into practical ways of doing this. 
  • Invite selected people from the statistical disclosure community to comment on this and say which tools they use. 
  • In Europe, GDPR has gotten to the point where you can't share any data. Do we want to go that route?
  • Rather than get so extreme, maybe just leave age out of this data, or change it to meet the risk threshold.
  • Is there a scalable way to do this?
  • Radiology has been immature about this and has not considered existing research into approaches.
  • David C. recommends that we read the CAR papers: Canadian Association of Radiologists White Paper on De-Identification of Medical Imaging:
  • David G: I realize this is a nuance.. but if we recommend X, and many of the people on this group don't currently do X because it's extremely difficult/nebulous. do we shoot ourselves in the foot? bone density?
  • Fred: Hospitals in different countries. Multiple ethics review boards in the same country. Need a threshold that is agreed upon if you are going to de-id anything.
  • The risk is finite, so we need to pick a threshold.
  • Threshold: Probability of re-id based on threat model. Pick the most conservative one and compute the probability. 
  • Justin: I'd be very interested to try applying one or two of these automated tools David mentioned against a couple of TCIA datasets to see what happens and help inform the recommendation in the report.
  • David C: Yes, let's try this.
  • People need to balance utility and risk and insure themselves in the meantime.
  • Brian: Utility has zero value to legal people. Risk is always increasing because technology gets better and better.
  • Countermeasures which hopefully will keep everything balanced. With released data, unless you are going to pull it back and not release it, the risk always goes up.
  • Should we get a guest speaker for the next meeting? Group says yes.

ITEM 8

ITEM 9

  • This is an obvious item that should be stated.
  • David is undecided how to deal with stratification. Do people agree that it's fine to stratify images and not check each one?
  • Brian checks each one because the metadata is not reliable enough to stratify it. Others agree with looking at everything and not stratifying.

Actions:

  • David Clunie will recruit a statistician with the right expertise to speak with us at the March 8 meeting.
  • David will continue refining the Interim Report.
  • All Task Group members are welcome to email David your comments on the report.

March 1, 2022 Meeting

Presentation by Khaled El Emam on Re-identification Risk Measurement - slides, slides with annotations.

  • Questions for group were how to pick a threat model, which identifiers to be concerned about, and how to establish a risk threshold for public data release.
  • Apply stratification principles to structured data. If you have unstructured data, structure it first.
  • Identity disclosure, which is just one type of disclosure but the type most applicable to re-id, is when a person's identity is assigned to a record.
  • Trying to measure the risk of verification for a dataset
  • Quasi-identifiers are those known by an attacker. 
  • Delete or encrypt/hash direct identifiers first. What we end up after that is synonymous data.
  • For the purposes of re-id risk, we only care about quasi-identifiers.
  • A meaningful re-id teaches you something new about the person.
  • Attack in two directions - population to sample, sample to population
  • Risk is measured by the group size (of 1 = unique)
  • Assign a risk value to each record in the dataset.
  • To reduce the risk, you can generalize the records and reduce the match rate.
  • You can suppress records, remove records, and add noise to reduce the risk of re-id as well.
  • generalize - group size gets bigger - risk reduces - maximum (k-anonymity)(public), average (non-public), unicity (proportion of records that are unique in the population)
  • You don't want to measure the risk in the data set but measure the risk in the population. The data set is just a sample from the population.
  • The group size in the population is the number that's important, but you have to estimate it, since you don't usually have a population registry.
  • Once you can estimate the risk properly, you can manage risk in a less conservative way that is still defensive.
  • There's no such thing as a probability of zero.
  • For releasing public data, a threshold in popular use today is .09. This will give you higher data quality. For particularly sensitive data sets, you would use the more strict threshold of .05.
  • risk denominator is not group size in sample but in population
  • risk threshold in identifiability spectrum
  • privacy-utility trade-off
  • data transformations - generalization, suppression, addition of noise, microaggregation
  • for non-public data, can add controls (privacy, security, contractual) to deal with residual risk.
  • motivated intruder attack-empirical way to evaluate your risk. Commission a white hat attack.
  • Two approaches for risk assessment: 1) model-based 2) motivated intruder attack.
  • Useful for public data releases. Helps find quasi-identifiers you didn't consider. 
  • For public data releases, it's harder to release complex data sets and still retain utility.

Two publicly available tools:

Papers:

El Emam background and bibliography

Discussion

  • Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
  • El Emam: You can estimate all of these things. The methods I described can be used to estimate those values.
  • Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
  • Worry that we underestimate populations and destroy data more than we need to.
  • El Emam: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
  • We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
  • El Emam: There's a buffer there. Only in a quarter of cases are you able to validate a match.
  • Data quality issues
  • Data sets that are published have errors in them
  • When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
  • Risk is below the threshold in practice.
  • How does it work well in practice?
  • El Emam: When motivated intruder attacks were done on properly de-identified data, nothing was found.
  • El Emam: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
  • Are there off-the-source tools one can use to do this–estimating population and computing risk?
  • How do we determine the real risk? Who would want to find out who an image belongs to in TCIA? How do I define the real risk? 
  • El Emam: It's a legal requirement and there's motivation by academics and the media. Can build a career.
  • All re-id attacks are done by academics and the media.
  • There's a risk being a soft target.
  • When a dataset it is incrementally increased, such as 100 new individuals added to a dataset of 10,000, do you estimate re-id risk based on the delta or the whole population? It depends how large the delta is. It can be a statistical/methodology argument for estimating population group sizes.