...

This is an obvious item that should be stated.
David is undecided how to deal with stratification. Do people agree that it's fine to stratify images and not check each one?
Brian checks each one because the metadata is not reliable enough to stratify it. Others agree with looking at everything and not stratifying.

Actions:

David Clunie will recruit a statistician with the right expertise to speak with us at the March 8 meeting.
David will continue refining the Interim Report.
All Task Group members are welcome to email David your comments on the report.

...

Link in New Window

linkText	WebEx recording of the 03/01/2022 meeting
href	https://cbiit.webex.com/cbiit/ldr.php?RCID=dfc6ce5f047a8b1a11d4729304d10ac3

Presentation by Khaled El Emam on Re-identification Risk Measurement - slides, slides with annotations.

Questions for group were how to pick a threat model, which identifiers to be concerned about, and how to establish a risk threshold for public data release.
Apply stratification principles to structured data. If you have unstructured data, structure it first.
Identity disclosure, which is just one type of disclosure but the type most applicable to re-id, is when a person's identity is assigned to a record.
Trying to measure the risk of verification for a dataset
Quasi-identifiers are those known by an attacker.
Delete or encrypt/hash direct identifiers first. What we end up after that is synonymous data.
For the purposes of re-id risk, we only care about quasi-identifiers.
A meaningful re-id teaches you something new about the person.
Attack in two directions - population to sample, sample to population
Risk is measured by the group size (of 1 = unique)
Assign a risk value to each record in the dataset.
To reduce the risk, you can generalize the records and reduce the match rate.
You can suppress records, remove records, and add noise to reduce the risk of re-id as well.
generalize - group size gets bigger - risk reduces - maximum (k-anonymity)(public), average (non-public), unicity (proportion of records that are unique in the population)
You don't want to measure the risk in the data set but measure the risk in the population. The data set is just a sample from the population.
The group size in the population is the number that's important, but you have to estimate it, since you don't usually have a population registry.
Once you can estimate the risk properly, you can manage risk in a less conservative way that is still defensive.
There's no such thing as a probability of zero.
For releasing public data, a threshold in popular use today is .09. This will give you higher data quality. For particularly sensitive data sets, you would use the more strict threshold of .05.
risk denominator is not group size in sample but in population
risk threshold in identifiability spectrum
privacy-utility trade-off
data transformations - generalization, suppression, addition of noise, microaggregation
for non-public data, can add controls (privacy, security, contractual) to deal with residual risk.
motivated intruder attack-empirical way to evaluate your risk. Commission a white hat attack.
Two approaches for risk assessment: 1) model-based 2) motivated intruder attack.
Useful for public data releases. Helps find quasi-identifiers you didn't consider.
For public data releases, it's harder to release complex data sets and still retain utility.

Two publicly available tools:

- SDC Micro (R package) - link - main paper - GUI application paper
- ARX - link - main paper - list of papers

Papers:

motivated intruder attack - Branson et al
confidence instead of known identity (UK) - Tudor et al

El Emam background and bibliography

Discussion

Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
El Emam: You can estimate all of these things. The methods I described can be used to estimate those values.
Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
Worry that we underestimate populations and destroy data more than we need to.
El Emam: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
El Emam: There's a buffer there. Only in a quarter of cases are you able to validate a match.
Data quality issues
Data sets that are published have errors in them
When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
Risk is below the threshold in practice.
How does it work well in practice?
El Emam: When motivated intruder attacks were done on properly de-identified data, nothing was found.
El Emam: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
Are there off-the-source tools one can use to do this–estimating population and computing risk?
How do we determine the real risk? Who would want to find out who an image belongs to in TCIA? How do I define the real risk?
El Emam: It's a legal requirement and there's motivation by academics and the media. Can build a career.
All re-id attacks are done by academics and the media.
There's a risk being a soft target.
When a dataset it is incrementally increased, such as 100 new individuals added to a dataset of 10,000, do you estimate re-id risk based on the delta or the whole population? It depends how large the delta is. It can be a statistical/methodology argument for estimating population group sizes.

April 12, 2022 Meeting

Link in New Window

linkText	WebEx recording of the 04/12/2022 meeting
href	https://cbiit.webex.com/cbiit/ldr.php?RCID=252a3cdaab0ac10863bf8babd810e3c8

Interim Report Best Practices And Recommendations Extract as of 20220411

AgendaAgenda:

Whole-slide images de-identification goals and issues
Define project to assess statistical risk of re-identification from images reconstructed as faces
Continue review of draft best practices document

...

Discussion

Introduction to new potential task group member, David Brundage, professor at Cornell.
Whole-slide images (WSI) are not usually in DICOM format and must be converted.
Dave Gutman shared slides about protected health information in WSI.
Most PHI lives in the slide label. Sometimes only a partial label is scanned, so a human might not realize PHI is there, but a machine can detect that PHI.
The primary image is unlikely to contain PHI.
Luke Geneslaw also shared some slides about detecting label presence in tissue image scans.
Trade-off between missing tissue that has cancer on it and bigger files that could have more PHI in the data.
Leaving data in the slide label causes problems for de-id. There isn't software out there to redaction of pixel data from the tissue sample. Dave Gutman is working on an NCI project to develop it.
JPEG stores data in 8x8 blocks, so it's possible to remove individual blocks from an image.
Metadata extraction is unique to format.
TCIA has a dictionary of private data elements from DICOM.
Python package: https://github.com/DigitalSlideArchive/tifftools
Date can be in TIFF times and other data elements defined in the XML, or included as an annotation.
It's our job to identify which areas need to be mitigated.
Not all slides are standard formats, like prostate whole mounts.
David Clunie would like to create a sub-group of people with a special interest and experience in WSI so that they can create content on that subject for the report. Fred Prior, Dave Gutman, David Brundage will join.
We need a person with significant statistical knowledge who could adapt their knowledge to defacing. Justin suggested someone and will talk to David about it.
David shared the new version of the task group's de-id report. Please follow the tracked changes offline and if you have any comments on it, please let him know.
Common stratification based on type–the task group determined that this is not sufficient and we should be looking at everything regardless. Sentence added to report.
We have not yet defined a best practice on how to score risk. Further research is needed.

Content

Space Tools

Versions Compared

Old Version 11

New Version 12

Key

April 12, 2022 Meeting

Content

Space Tools

Page History

Versions Compared

Old Version 11

New Version 12

Key

April 12, 2022 Meeting