NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Brian Bialecki on De-identification

  • His team at ACR is trying to find a way to release this data publicly.
  • He would like to get patient consent to share some identifying data.
  • They'd like to see the real data to assess both the real world risk of doing nothing and the real world risk of various mitigation approaches. 

Other Discussion

  • Skull stripping is not a replacement for for de-facing.
  • The value of SynthStrip is not in how well the model performs, but rather how the model is created.
  • It's difficult to find things that work across different data sets and modalities. This is something we desperately need.
  • SynthStrip does not work on slices, it is fully 3D.
  • Access model, registered or restricted, for data with a data use agreement. This is what everyone is converging on.
  • Record and track who got the data. But some repositories have no such tracking.

Next Meeting

  • Skipping July
  • August 9, 2022 at 1 p.m. EST

August 9, 2022 Meeting

Link in New Window
linkTextWebEx recording of the 08/09/2022 meeting
hrefhttps://cbiit.webex.com/cbiit/ldr.php?RCID=2318a677560a423446955128af17413c

Agenda

Applicability of SDC tools to medical image metadata – ARX

Discussion

  • Statistical re-identification of radiology and pathology data. Data includes metadata and spreadsheets accompanying the images.
  • Review of past discussion:
    • Statistical disclosure control
    • Statistical approaches are mentioned in HIPAA. HIPAA has a privacy rule that is an alternative to the Safe Harbor mechanism
    • Presentation by Dr. El Amam.
    • Estimating re-identification risk and attempting to reduce this.
    • Asked Dr. El Amam which tools exist to help to do this. He said ARX and STC Micro.
  • Today David demonstrated the ARX Anonymization Tool, which is a java-based package that runs on any platform. See this tool's YouTube channel at https://www.youtube.com/channel/UCcGAF5nQ_O6ResEF-ivsbVQ/videos
    • David demonstrated how to use this tool, based on his limited understanding of it. 
    • Imported a spreadsheet of CPTAC proteomic metadata. This dataset has around 65000 records, so it's large enough to use for this demonstration. This data is already in IDC.
    • This dataset has both actual- and quasi-identifiers.
    • He selected quasi-identifiers.
    • You have to set the sensitivity. You can set whether the data is a quasi-identifier. David set Gender and Age as quasi-identifying.
    • Prosecutor, journalist, and marketer attacker model are shown. Risk shown of how successful they will be. The risk is low for the selected dataset.
    • A prosecutor risk occurs if the adversary can know that the target is in the data set.
    • Distribution of risks in a histogram–prosecutor re-id risk on X axis, records affected on Y axis.
    • ARX is an ptimization tool that, through numerical methods, attempts to optimize changes to the data to reduce the risk and at the same time preserve the utility. It can use more than one privacy model.
    • David tested two models: a K anonymity model with a K of 2 and the average re-identification risk, which is a more complicated model.
    • David provided a generalization pattern by which the tool can aggregate the data as it chooses.
    • He demonstrated creating a hierarchy.
    • After anonymization, there are no more uniques and the quality has not been reduced.
    • The tool makes the data a bit fuzzier without getting rid of data.
    • You can see which rows have been anonymized using the Analyze Utility. Ages are binned or data is omitted according to the model's constraints.
    • Demonstration of adding in race and age. The tool is more aggressive in the handling of age. 
    • This tool also has an API.
    • A small data set increases re-identification risk since there are more uniques.
  • Adam Taylor will explore this tool for HTAN.
  • David is working on drafting the report and hopes to have it done by the next meeting. He will include a section on statistical disclosure control and microdata.