July 20, 2021 Meeting
WebEx recording of 7/20/2021 meeting
- Introduction: Medical Image De-Identification Initiative (MIDI)
- Task Group goals
- Steering Committee
- Timeline
- Discussion
August 10, 2021 Meeting
WebEx recording of 8/10/2021 meeting
- Instructions to access the MIDI Task Group wiki page
- Accept Mendeley invitation to access private group for literature review/annotated bibliography
- Outline of approach
- metadata vs. pixel data
- structured (strongly typed) vs. text
- burned-in text ("printed" and hand-written)
- identifiable features (e.g., faces, iris, retina)
- with or without "public" data to compare with
- Challenging topics
- evaluation of success of de-identification
- quantitative comparison of performance
- quantifying re-identification risk
- creating test data sets
- faces (etc.) reconstructed from cross-sections
- burned-in text - detection, removal, cleaning
- cleaning text descriptors (metadata or burned in)
- buried metadata (e.g., EXIF, geotags in JPEG inside DICOM)
- dates (incl. preserving temporal relationships)
- pseudonym consistency across separate submissions
- risks of hashing to create pseudonymous identifiers
- uniqueness of images limits statistical approaches
- loss allowable during de-identification (e.g., age fuzzing, pixels)
- private data element preservation to retain utility
- ultrasound - still frames and cine loops, lossy compressed
- photographs and video
- gross pathology and whole slide images (incl. labels)
- IRB/ethics committee messaging wrt. de-identification decisions
- IT security approval/audits of de-identification
- regulatory requirements: HIPAA Privacy Rule, GDPR, CCPA, others?
- sufficiency of standards, e.g., DICOM PS3.15 Annex E
- risk of not following a standard (home-grown decisions)
- threat of image "signatures", private set intersection methods
- policy versus the technical details of recompression/decompression artifacts for JPEG
- data minimization
- Inventory of tools
- user interface vs. scripted (bulk, service)
- configurable - user vs. installer vs. hard-coded
- platform, language
- open source, free, commercial, service
- on-site vs. outside (e.g., [IP]II needs to leave walls for AI on cloud)
- Roadmap and deliverables
- interim report
- full report
- "primer" on medical image de-identification for newbies/execs
- confirm what is out of scope (non-goals) - consent, data use agreements, ...
- interim report
- Tasking: Members to think about which task they would like to contribute to.
September 14, 2021 Meeting
WebEx recording of the 9/14/2021 meeting
- Role of AI in de-identification - demand for data, opportunities, threats
- Google has a de-id tool
- Amazon Comprehension
- Identifying images at risk–which images are likely to contain burned in information than others?
- Problem with scalability in terms of building the ruleset. Better to identify selectively.
- Barcodes, pacemaker serial numbers, implanted devices
- There is the potential of identifying objects but not the raw data.
- Action: Describe the steps involved in imaging and the evolution of data in different levels of processing
Case-based data - Is raw data in our purview?
- Raw data is often in proprietary format and can lack a header.
- Post-processed data like 3D reconstructions
- What is the harm of reidentification? High-resolution 3D image of the face
- Penetration testers that applies to de-ID
- How to evaluate the success of de-facing?
- Newman, L. H. (2016). AI Can Recognize Your Face Even If You’re Pixelated. Wired. https://www.wired.com/2016/09/machine-learning-can-identify-pixelated-faces-researchers-show/
- When is it okay to release information that you know is identifiable? Example of boy in NYT.
- Sometimes reidentification does not provide any new data.
- What do you now know that you didn't know before?
- Expectations of doing better deidentification and the threats of better reidentification. What can we do now and what in the future with AI?
- Do you expect that one day a machine will replace your manual deidentification process? Can a robot replace human review?
- Can you accept the risk of AI/machines/code? Get to the level of risk that is tolerable.
- Main topic for the next call: the need for human QC.
- When will you stop using humans or a targeted subset?
- What would increase your comfort level to help you stop using human QC.
October 12, 2021 Meeting
WebEx recording of the 10/12/2021 meeting
Discussion of this document:
- Not practical for a human to review all of the images.
- TCIA built a tool called Kaleidoscope that flattens images and saves time.
- Radiology techs can also do this work, but sensitivity goes down as you view more images.
- What is the cost of a data breach in terms of manpower?
- As screening goes up, breaches go up.
Discussion of the de-identification process:
- Did you have a formal QC process that involved you verifying the quality of the de-identification process after it was done?
- John Perry: developed a process and a test to make sure it worked, but didn't look at all of the images to confirm it was done without breaches.
- Monitor logs to make sure nothing slips through without automation applied to it. Grab a random 1% and look through the headers.
- Need a more medical model that understands the variability in what we're trying to do
- Partial vs. complete success-field or header
- Catch-22 that you can't crowd-source because there could be PHI
- Build synthetic datasets that have real street addresses in real places that don't match the actual data
- Train a model and release that but not the dataset
- Would need a statistician
- Judy: We are encountering issues that the black box models do not understand. Running experiments on adversarial networks. Surprising findings.
- Amalgamate clinical and imaging data.
- Models have already learned sufficient information to learn age, sex, and race. We don't understand how this happens and maybe they could pick up other identification data.
- We are not trying to hide age, sex, and race. We're trying to prevent the re-identification of a person.
- Increasing the uniqueness of the image data is a threat for re-identification. But if you don't have a database of everyone's fingerprints, for example, it's useless.
- At some point we have to be clear of what we are trying to reidentify and what the practical limits are.
- Clearview.ai
Tasking
Justin Kirby: report back on what TCIA encounters that is part of their human review processes
David Clunie: organize report topics in an outline
Judy: Write up some content (not the overview) on defacing
TJ and Ying: Can help with defacing