carderne/signal-export

Merging old and new

Closed this issue · 6 comments

Current behaviour

Currently the --old=/path/to/previous/export feature does the following:

  1. Get the messages and attachments form the old directory
  2. Extract the new messages and attachments from the Signal DB
  3. Merge these and output to whatever directory you have specified
  4. The old export is NOT touched.
    The point of this is to merge an export with an export from a previous phone or Signal account, or after deleting all your messages in Signal.
    Basically, the assumption is that there is NO overlap between the new and old.

Idea for new a feature

It would probably more commonly useful for most people to have a feature such as --update that did the following:

  1. Check the messages and attachments at the specified output directory (if any)
  2. Read the Signal DB
  3. SKIP any messages/attachments from the DB that are already in the output dir
  4. Export the new messages and attachments to that dir

This would make it much easier to keep a single "source of truth" of your conversations, just updating it once every x time period.

Caveats

  1. The only reason I haven't done this yet is that there is the potential for data destruction of very valuable data, and so I have preferred to leave it as an exercise to the reader to figure out how to merge the stuff...
  2. Does this provide an advantage over simply exporting, checking that all is well, and then deleting the old export?

@coppolab FYI

Concerning the current behaviour, I think it addresses extremely relevant use cases and should be definitely kept. In addition to those cases, I have to report that unintentional loss of whole Signal DB is easier than expected. I recently lost it on iPhone due to a failed/interrupted App update, and by design there is no way of restoring old data one the device. That's where I understood that any Signal-DB instance (device/Desktop) of mine can't be regarded as the "source of truth". That's where signal-export and its exported snapshot(s) gain even more relevance for me. I would suggest that such current behaviour should be blessed with a self explaining functional name. Not easy, but candidate keywords might refer to db/account/instance migration/replacement.

Concerning the "new feature" I totally support it and its focus on realizing a "single source of truth" that (as discussed above) would be even more reliable than the Signal DB, for the sophisticated user. However, defining exactly what is a legitimate "update" (set) that should be added to the previous export is not trivial.
To make it short, I see some complexity in dealing with messages/attachments that may be deleted from a conversation by a correspondent party long time after they were sent. That would be critical in many work/business scenarios. Should messages/attachment deletions in the future be considered updates to be enforced on the "source of truth"? My strong belief is not. And that would be a top feature of this project.
A simple way for implementing it would be just defining exact points in time and assuming that conversations on the export can never be retro-modified. A more refined way would be marking those messages/attachments as deleted, without actually deleting them. Such more refined model would also keep into account a possible future Signal behaviour of allowing message modification as other frameworks.

As a side note, I wish to point out that for the demanding user who needs such refined features, it is probably already totally reasonable to keep a history of plain non-incremental independent exports which are run according to some schedule. I am currently doing that, and the only serious drawback is the exponentially growing size of the attachments, because each one would be re-exported at any new run, forever.

My current workaround for the attachment explosion is periodically running the excellent https://github.com/pauldreik/rdfind on the set of my highly-redundant signal-export outputs, thus replacing them with unix hard links. For macOS users, please don't be tempted to run it over system/app folders as well, because it would break things, because of this: https://developer.apple.com/forums/thread/73260.

I have a use case where I occasionally like to backup disappearing messages (before they disappear). The current --old behavior does not fit this use case, as far as I can tell (or maybe I flagged it wrong?). I'm not really sure what happened, but here it is:

  1. I started with a backup performed on 2022-09-07 12:56 pm local time. If I look at ./signal-chats-new/contactname/index.html, the last entry is
 <div class="msg">
                <span class="date">
                    2022-09-05
                </span>
                <span class="time">
                    20:31
                </span>
                <span class="sender">
                    ContactName
                </span>
                <span class="body">
                    <p>
                        Yes
                    </p>
                </span>
                <span class="reaction">
                </span>
            </div>

1.9) In the intervening time, this message has passed its lifespan and has disappeared from the chat in the signal db
2) then, on 2022-09-08 at 12:09 pm local time, I run sigexport ./signal-chats-new2 --source /mnt/c/Users/Eli/AppData/Roaming/Signal/ --old ./signal-chats-new --paginate=0

  1. If I look at ./signal-chats-new2/contactname/index.html, I see the following last two entries:
<div class="msg">
                <span class="date">
                    2022-09-02
                </span>
                <span class="time">
                    13:53
                </span>
                <span class="sender">
                    ContactName
                </span>
                <span class="body">
                    <p>
                        Incoming call
                    </p>
                </span>
                <span class="reaction">
                </span>
            </div>
            <div class="msg">
                <span class="date">
                    2022-09-07
                </span>
                <span class="time">
                    23:26
                </span>
                <span class="sender">
                    ContactName
                </span>
                <span class="body">
                    <p>
                        What’s your schedule like tomorrow?
                    </p>
                </span>
                <span class="reaction">
                </span>
            </div>

Where the disappeared message is gone. The older missed phone call existed in the export from step 1) as well, I just didnt include it. I'm including it here to show clearly that the missing message is between not-missing messages.

I do like the idea of an export process that appends newer messages to an older copy of a backup, which it sounds like you're proposing as a enhancement. If I am correct in interpretation, then +1 to this idea.

Edit: I also tried --old=/path/to/old/export vs. --old /path/to/old/export because I'm not familiar with python conventions and it failed exactly the same way. The main page has --old /path and this issue has --old=/path.

Is this enhancement in a quasi-implemented state? I can't currently get the merge to work, it seems to always just output the latest/current output and does not use anything from --old path. I am creating an output from an older Signal backup folder that has a gap of several months from the last message date to the earliest message from the current Signal db. Both have the same config.json file, but not sure what else is involved in the merge.

The output of each export works great, but the merging does not. Please let me know if I should create a separate issue, thank you.

The output of each export works great, but the merging does not. Please let me know if I should create a separate issue, thank you.

i was able to get it working by changing line 428 from:

dir_new = old / name
to

dir_new = dest / name

but I don't know what else that will effect, but that export seems to be what I would expect. Not sure if bug or not, thanks.

@zephyr707 Ouch, thanks for finding that bug! Will fix right now.

I don’t use it often enough myself (especially this code path), and haven’t put enough effort into the tests to confidently avoid mistakes like that… And yes the merge feature is quasi-implemented, in that I double check the results after using it because I don’t really trust it.

nice one, thanks! have to say this proj is saving me, so thanks for all your work. i upgraded to a new phone and didn't realize signal didn't migrate, so i'm piecing together my signal desktop history. i was originally going to try to merge an old backup db.sqlite with the current one, but this proved way beyond my abilities haha, so this is a really good way to preserve chat history