arx-deidentifier/arx

[BUG] GUI utilizes hierarchy incorrectly

ZachHaber opened this issue · 8 comments

From #403, the issue relating to the priority-based hierarchy builder appears to be only tangentially related and I was asked to create it as a separate issue.

The hierarchy that is created via the priority-based hierarchy ends up with a different output compared to making the same data transformation manually.

In the case mentioned, the anonymization step used a level that didn't exist when you reference the data-transformation - null is kept while White is dropped, and thus produced a sub-optimal output.

Original Message:


prasser There's something funky with the underlying hierarchy currently.

The hierarchy creation functionality works great and exactly as I'd like! Though a bit backwards. Highest to lowest is the default option, which goes in exact opposite order than what I'd expect. The mode of the data should be the last one to be dropped instead of the first with frequency prioritization (highest to lowest).

The issue I'm seeing is that the anonymizing doesn't apply the hierarchy as it displays in the "data transformation" tab.

On a dataset I'm using, (unfortunately I can't paste it for easy reference), the hierarchy when generating lowest to highest (to ensure that the mode of the data is the last dropped), looks like this:
image
If I generate the results, I see that it decides to drop the Race column instead of keeping "White" only, which is the result I got when I manually created this hierarchy earlier.

image

I can find the exact value that I used originally for this data set when I look for "Non-anonymous" transformations, which is really odd, because like I said, it should be the same result.
image

image

Instead of keeping "White", it kept "" (empty) as "Level-3" Race. I also tried the "Highest to Lowest" option, and the same result occurred.

When I edit the hierarchy and tell it to remove the underlying representation of the current hierarchy, it suddenly works exactly as it should, with the same results as v3.9.0 and the manual transformation settings I put together.
image

image
image

Originally posted by @ZachHaber in #403 (comment)

I've gotten the problem down to a test set:
issue-405-test-set.zip

In the zip file, there's 4 .csv files and 5 .deid project files (created from those csv files). This is all generated using a .jar built from the #403 PR code (https://github.com/arx-deidentifier/arx/tree/8064f2d77606ad4e4cc02b5cb342ff10ad639e17 for future reference).

I did some experimenting to try and see where the problem lies and where it doesn't. The issue isn't due to empty values (issue-405-no-empties), but still has something to do with the order of the data. Between issue-405-ordered-input and issue-405-non-empty-first, you can see that the results are completely the opposite, so the order (which value comes first) in the original data-set definitely matters.

issue-405-k-2 and issue-405-k-3 I included because they were my first attempts, but also to highlight how the different k values changes the same input. k-3 has an interesting case where the level-0 result has the empty values suppressed, which ends up with level-0 and level-1 both applying the same transformation, but for some reason on this k-3 result, the suppression for level-1 is what it should be. Whereas for k-2 result, the level-1 suppression is wrong (suppressing only the non-empty values, while the transformation says the empty values should be suppressed). However, the level-0 suppression for k-2 ends up being correct where no values are suppressed.

I took a look now and I am not able to reproduce the problem. Could you maybe provide me with step-by-step instructions that I can use to reproduce the issue with one of the files you provided in the zip archive?

You should be able to see the problem by opening up one of the project files and comparing the "Race" column transformation in "Analyze Utility" with the hierarchy from the "Configure Transformation" tabs.

Step by step instructions (most of these should apply to the other .csv files in the archive as well, some of them just require different settings, like reversing the order of the hierarchies):

  1. Create new Project
  2. Import "issue-405.csv" with default settings
  3. Set "Race" as Quasi-Identifying
  4. Create a Hierarchy on Race: "Use Priorities" => "Prioritize by frequency (highest to lowest)" (or lowest to highest if using the previous commit from the feature-frequency-hierarchy branch) => Finish
  5. Observe that Level-1 should have "White" => "White" and "" => "*" as the transformation
  6. Set up 2-Anonymity privacy model
  7. Edit -> Anonymize with default settings Optimal/Global transformation
  8. In "Explore Results" view Apply Transformation "1" (which isn't the optimal in this case"
  9. In Analyze Utility, note that the transformation that was applied was "" => "" and "White" => "*"
  10. In Configure Transformation, edit the data transformation for "Race" to remove the underlying hierarchy without changing it, then re-do steps 7-8, and notice that the applied transformation is now correct for Tranformation 1.

Note: it could very well be that this whole bug is because I don't know what I'm doing with building Java projects, and I've only managed to get it to build with the default Eclipse Java 2022-06 which runs/builds via Java 17. I have no idea how to build it such that it works with Java 8 JRE.

Ok, thanks! I can now see that problem. I'm sure that it's not related to compilation.

Should be fixed by commit 291821f. If possible, please take a look whether it now works as expected.

So far, it's working as expected! On the test datasets (after rebuilding the hierarchies, because it fails to generate with the existing ones), it works. On Monday, I'll try again on the large real dataset I have to make sure :)
Thanks!

@prasser It's working on my original dataset that I found the issue on as well!

Thanks. Resolved.