NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Q: Alzheimer's MRI data set. If I heard that an Asian man died of Alzheimer's in Atlanta on a certain date, can I find the person's brain in the data set? What is this called?
  • El Emam: You can estimate all of these things. The methods I described can be used to estimate those values.
  • Unless your data set includes everyone who ever had Alzheimer's, you don't have the full population.
  • Worry that we underestimate populations and destroy data more than we need to.
  • El Emam: You have to use your best judgement to come up with these numbers and document the process. In practice, if you go through these methods it's hard to re-id a person. It's not impossible but very difficult.
  • We don't know who these people are. There is no registry of who has brain tumors in the US. We have death records.
  • El Emam: There's a buffer there. Only in a quarter of cases are you able to validate a match.
  • Data quality issues
  • Data sets that are published have errors in them
  • When you factor in verification of suspected matches plus data quality issues, risk goes down quite a bit.
  • Risk is below the threshold in practice.
  • How does it work well in practice?
  • El Emam: When motivated intruder attacks were done on properly de-identified data, nothing was found.
  • El Emam: Weight of evidence is on the pragmatic process. Allows us to release useful data. Model that is too conservative doesn't allow you to release useful data-an exercise in theater. You can make your threshold more strict over time.
  • Are there off-the-source tools one can use to do this–estimating population and computing risk?
  • How do we determine the real risk? Who would want to find out who an image belongs to in TCIA? How do I define the real risk? 
  • El Emam: It's a legal requirement and there's motivation by academics and the media. Can build a career.
  • All re-id attacks are done by academics and the media.
  • There's a risk being a soft target.
  • When a dataset it is incrementally increased, such as 100 new individuals added to a dataset of 10,000, do you estimate re-id risk based on the delta or the whole population? It depends how large the delta is. It can be a statistical/methodology argument for estimating population group sizes.

April 12, 2022 Meeting

Agenda:

  • Whole-slide images de-identification goals and issues
  • Define project to assess statistical risk of re-identification from images reconstructed as faces
  • Continue review of draft best practices document

Interim Report Best Practices And Recommendations Extract as of 20220411