realrolfje/anonimatron

How to anonymise data within CSV files using Anonimatron?

Opened this issue · 7 comments

Hi
I need to anonymise CSV files, consistently (same identifiers, such as names, have the same code). I would like to know about the correct jdbc URL for CSV files.
Any help is greatly appreciated.
Thank you.

Hello Maryam192, Sorry for the late reply.

Anonymizing CSV files is possible by configuring an intput and output file instead of a database. When you run anonimatron with the option --configexample you will find the csv example at the end. If you just leave the database stuff out, your configufile could look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <file inFile="default_types.in.csv"
        reader="com.rolfje.anonimatron.file.CsvFileReader"
        outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
        <column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/>
        <column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/>
    </file>
</configuration>

You can have multiple csv files in one configuration file, and you can re-use the synonym file between runs.

Hello Mayam, the files you attach to a mail are not forwarded by github, and not attached to the github issue. Can you please attach the examples to the issue? I'll be glad to have a look at it. (don't forget to remove passwords and personal info first)

Hello Rolf,
Thank you again for your consideration. I have attached the files (csv attachments are not supported and I had to convert them to xlsx files. Also xml is not supported and I had to convert it to a txt file).
Thank you.
Regards,
Maryam
out.xlsx

config.txt
IncidentOverview1.xlsx

Hello Maryam, I see that the CSV reader implementation is not that robust. It can not handle "header rows", as you file has. It treats all the rows the same, and although the configuration file uses "name" for the columns, it is actually a number that needs to be filled in. This needs to be fixed to make it usable. It also can not handle comma's or semicolons inside a field, I just noticed.

For now, a workaround to get it running in your case is:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <file 
      inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader"
      outFile="out.csv"              writer="com.rolfje.anonimatron.file.CsvFileWriter">
      <column name="1" type="ROMAN_NAME"   size="100"/>
      <column name="5" type="ROMAN_NAME"   size="100"  />
      <column name="6" type="ROMAN_NAME"   size="100"  />
      <column name="7" type="RANDOMDIGITS" size="20"/>
  </file>
</configuration>

This is not ideal, as the 7th (last) column in your file contains a comma in the data, and it will be treated as column 7 and 8. I'll see what I can do about this, but I need a bit of time to fix it (and also keep it downwards compatible).

I hope this helps you a bit, thanks for the patience, examples and config.

I think the garbled output has to do with file encoding. Input and output file encoding should be UTF-8, is there a way you can check that?