UAlbanyArchives/mailbagit

Fully preserve PST folder structure including empty folders

gwiedeman opened this issue · 6 comments

The problem the component solves

The PST parser currently reads the email folder structure within the file and uses this path to write single message derivatives, such as EML, PDF, etc. However, the parser only returns messages to the model with their paths as attributes. Thus, empty folders are not returned to the model (they just log an error) and do not get written to the mailbag. I think empty directories should be created in a mailbag for empty folders in a PST.

Two approaches are either to adjust the model to handle an empty directory Message object. An attribute could denote whether to treat this as an actual message or not. A second approach could be to build a list in all parsers that gets returned to the model.

Relevant part of mailbag spec?

4.2 Additional subdirectories within format subdirectories

Type of component

  • Core
  • Input
  • Attachments
  • Derivatives conversion
  • Reporting/Exporting
  • GUI
  • Distribution

Expected contribution

  • Pull Request
  • Comment with proposed solution

Major challenges or things to keep in mind

In looking at this, I discovered that we are not really handing PST folder structure correctly. #175 fixes this. Looking at the full folder structure of an account, PSTs have a bunch of other folders outside of Inbox which look like they may contain search indexes or something. It actually might be good practice to ignore all folders that don't contain messages, so this is on hold until we have a decision.

It would be quite challenging to make empty folders in derivatives for empty PST folders, as the messages() generator would have to return a list of these directories to the controller or something, which adds a good amount of complexity for all formats.

As discussed with the Advisory board, a basic step to minimally address this could be to raise a warning for empty folders and include that in the error reports. Now that #198 splits warnings into a separate error report, swamping the error reports with warnings is no longer a problem. An enhanced solution would be to also create empty folders in the mailbag for message level derivatives (PDF, EML, WARC, etc.). Thus, if there was an empty Deleted Items folder in the PST, there would be an empty folder in the mailbag: my_mailbag/data/pdf/Top of Outlook data file/Deleted Items.

Both of these solutions are challenging, as the parsers currently only return data to the controller though the Email model, which is per message. The controller writes the error/warn reports and the derivatives modules write the mailbag directories from the controller per message.

#201 turned the log warnings back on, but still doesn't return the empty folder paths to the controller, so they still do not get added to the warnings report, nor do empty folders get created for message-level derivatives.

To be clear, the empty folders are handled (or not handled) here and here.

Just thinking about this, the easiest solution might be to just manually open() and write() a report to the my_mailbag_warnings. It would be outside of the error reports process that the controller managed, but it would work for the minimal solution. The existing reports are all message based, which wouldn't apply anyway. May be just my_mailbag_warnings/folder_name.txt with a warning in it would do it? Would need to use the normalize_path{} helper.

I still wish there was a better solution, but that might be better than having to rewrite how the parsers work or anything large like that.

TODO:

  • fix account_data across base class and all concrete formats
  • number_of_messages rationalization

Addressed by #216