Page History
...
- Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
- El Emam: You can estimate all of these things. The methods I described can be used to estimate those values.
- Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
- Worry that we underestimate populations and destroy data more than we need to.
- El Emam: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
- We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
- El Emam: There's a buffer there. Only in a quarter of cases are you able to validate a match.
- Data quality issues
- Data sets that are published have errors in them
- When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
- Risk is below the threshold in practice.
- How does it work well in practice?
- El Emam: When motivated intruder attacks were done on properly de-identified data, nothing was found.
- El Emam: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
- Are there off-the-source tools one can use to do this–estimating population and computing risk?
- How do we determine the real risk? Who would want to find out who an image belongs to in TCIA? How do I define the real risk?
- El Emam: It's a legal requirement and there's motivation by academics and the media. Can build a career.
- All re-id attacks are done by academics and the media.
- There's a risk being a soft target.
- When a dataset it is incrementally increased, such as 100 new individuals added to a dataset of 10,000, do you estimate re-id risk based on the delta or the whole population? It depends how large the delta is. It can be a statistical/methodology argument for estimating population group sizes.
April 12, 2022 Meeting
Agenda:
- Whole-slide images de-identification goals and issues
- Define project to assess statistical risk of re-identification from images reconstructed as faces
- Continue review of draft best practices document
Interim Report Best Practices And Recommendations Extract as of 20220411