NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • This is an obvious item that should be stated.
  • David is undecided how to deal with stratification. Do people agree that it's fine to stratify images and not check each one?
  • Brian checks each one because the metadata is not reliable enough to stratify it. Others agree with looking at everything and not stratifying.

Actions:

  • David Clunie will recruit a statistician with the right expertise to speak with us at the March 8 meeting.
  • David will continue refining the Interim Report.
  • All Task Group members are welcome to email David your comments on the report.

...

Link in New Window
linkTextWebEx recording of the 03/01/2022 meeting
hrefhttps://cbiit.webex.com/cbiit/ldr.php?RCID=dfc6ce5f047a8b1a11d4729304d10ac3

Presentation by Khaled El Emam on Re-identification Risk Measurement - slides, slides with annotations.

  • Questions for group were how to pick a threat model, which identifiers to be concerned about, and how to establish a risk threshold for public data release.
  • Apply stratification principles to structured data. If you have unstructured data, structure it first.
  • Identity disclosure, which is just one type of disclosure but the type most applicable to re-id, is when a person's identity is assigned to a record.
  • Trying to measure the risk of verification for a dataset
  • Quasi-identifiers are those known by an attacker. 
  • Delete or encrypt/hash direct identifiers first. What we end up after that is synonymous data.
  • For the purposes of re-id risk, we only care about quasi-identifiers.
  • A meaningful re-id teaches you something new about the person.
  • Attack in two directions - population to sample, sample to population
  • Risk is measured by the group size (of 1 = unique)
  • Assign a risk value to each record in the dataset.
  • To reduce the risk, you can generalize the records and reduce the match rate.
  • You can suppress records, remove records, and add noise to reduce the risk of re-id as well.
  • generalize - group size gets bigger - risk reduces - maximum (k-anonymity)(public), average (non-public), unicity (proportion of records that are unique in the population)
  • You don't want to measure the risk in the data set but measure the risk in the population. The data set is just a sample from the population.
  • The group size in the population is the number that's important, but you have to estimate it, since you don't usually have a population registry.
  • Once you can estimate the risk properly, you can manage risk in a less conservative way that is still defensive.
  • There's no such thing as a probability of zero.
  • For releasing public data, a threshold in popular use today is .09. This will give you higher data quality. For particularly sensitive data sets, you would use the more strict threshold of .05.
  • risk denominator is not group size in sample but in population
  • risk threshold in identifiability spectrum
  • privacy-utility trade-off
  • data transformations - generalization, suppression, addition of noise, microaggregation
  • for non-public data, can add controls (privacy, security, contractual) to deal with residual risk.
  • motivated intruder attack-empirical way to evaluate your risk. Commission a white hat attack.
  • Two approaches for risk assessment: 1) model-based 2) motivated intruder attack.
  • Useful for public data releases. Helps find quasi-identifiers you didn't consider. 
  • For public data releases, it's harder to release complex data sets and still retain utility.

Two publicly available tools:

Papers:

El Emam background and bibliography

Discussion

  • Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
  • El Emam: You can estimate all of these things. The methods I described can be used to estimate those values.
  • Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
  • Worry that we underestimate populations and destroy data more than we need to.
  • El Emam: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
  • We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
  • El Emam: There's a buffer there. Only in a quarter of cases are you able to validate a match.
  • Data quality issues
  • Data sets that are published have errors in them
  • When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
  • Risk is below the threshold in practice.
  • How does it work well in practice?
  • El Emam: When motivated intruder attacks were done on properly de-identified data, nothing was found.
  • El Emam: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
  • Are there off-the-source tools one can use to do this–estimating population and computing risk?
  • How do we determine the real risk? Who would want to find out who an image belongs to in TCIA? How do I define the real risk? 
  • El Emam: It's a legal requirement and there's motivation by academics and the media. Can build a career.
  • All re-id attacks are done by academics and the media.
  • There's a risk being a soft target.
  • When a dataset it is incrementally increased, such as 100 new individuals added to a dataset of 10,000, do you estimate re-id risk based on the delta or the whole population? It depends how large the delta is. It can be a statistical/methodology argument for estimating population group sizes.

April 12, 2022 Meeting

Link in New Window
linkTextWebEx recording of the 04/12/2022 meeting
hrefhttps://cbiit.webex.com/cbiit/ldr.php?RCID=252a3cdaab0ac10863bf8babd810e3c8

Interim Report Best Practices And Recommendations Extract as of 20220411

AgendaAgenda:

  • Whole-slide images de-identification goals and issues
  • Define project to assess statistical risk of re-identification from images reconstructed as faces
  • Continue review of draft best practices document

...

Discussion


  • Introduction to new potential task group member, David Brundage, professor at Cornell.
  • Whole-slide images (WSI) are not usually in DICOM format and must be converted.
  • Dave Gutman shared slides about protected health information in WSI.
  • Most PHI lives in the slide label. Sometimes only a partial label is scanned, so a human might not realize PHI is there, but a machine can detect that PHI.
  • The primary image is unlikely to contain PHI. 
  • Luke Geneslaw also shared some slides about detecting label presence in tissue image scans.
  • Trade-off between missing tissue that has cancer on it and bigger files that could have more PHI in the data.
  • Leaving data in the slide label causes problems for de-id. There isn't software out there to redaction of pixel data from the tissue sample. Dave Gutman is working on an NCI project to develop it.
  • JPEG stores data in 8x8 blocks, so it's possible to remove individual blocks from an image.
  • Metadata extraction is unique to format.
  • TCIA has a dictionary of private data elements from DICOM.
  • Python package: https://github.com/DigitalSlideArchive/tifftools
  • Date can be in TIFF times and other data elements defined in the XML, or included as an annotation. 
  • It's our job to identify which areas need to be mitigated.
  • Not all slides are standard formats, like prostate whole mounts.
  • David Clunie would like to create a sub-group of people with a special interest and experience in WSI so that they can create content on that subject for the report. Fred Prior, Dave Gutman, David Brundage will join.
  • We need a person with significant statistical knowledge who could adapt their knowledge to defacing. Justin suggested someone and will talk to David about it.
  • David shared the new version of the task group's de-id report. Please follow the tracked changes offline and if you have any comments on it, please let him know.
  • Common stratification based on type–the task group determined that this is not sufficient and we should be looking at everything regardless. Sentence added to report.
  • We have not yet defined a best practice on how to score risk. Further research is needed.