ragibson/SMS-MMS-deduplication

Deduplication misses messages with inconsistent timestamp precision

ragibson opened this issue · 1 comments

It looks like some backup and/or recovery agents trim timestamp precision to the seconds level only (rather than millisecond).

That can cause two messages to truly be different since they technically appear to be received at different times, but they should be removed anyway.

Now, when running on

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<smses count="7" type="full">
  <sms protocol="0" address="+11111111111" date="1600000001234" type="1" subject="null" body="Some text content here" />
  <sms protocol="0" address="+11111111111" date="1600000000345" type="1" subject="null" body="Some text content here" />
  <sms protocol="0" address="+11111111111" date="1600000001000" type="1" subject="null" body="Some text content here" />
  <sms protocol="0" address="+11111111111" date="1600000001789" type="1" subject="null" body="Some text content here" />
  <mms date="1700000001234" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
    <addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
  </mms>
  <mms date="1700000000345" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
    <addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
  </mms>
  <mms date="1700000001000" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
    <addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
  </mms>
  <mms date="1700000001789" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
    <addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
  </mms>
</smses>

with default arguments, nothing is deduplicated since the timestamps are different.

However, running with the --ignore-date-milliseconds flag deduplicates to

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<smses count="4" type="full">
  <sms protocol="0" address="+11111111111" date="1600000001234" type="1" subject="null" body="Some text content here"/>
  <sms protocol="0" address="+11111111111" date="1600000000345" type="1" subject="null" body="Some text content here"/>
  <mms date="1700000001234" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106"/></parts>
    <addrs><addr address="+11111111110" type="137" charset="106"/><addr address="+11111111111" type="151" charset="106"/><addr address="+11111111112" type="151" charset="106"/></addrs>
  </mms>
  <mms date="1700000000345" address="+11111111111~+11111111112">
    <parts><part seq="0" ct="text/plain" name="null" chset="106"/></parts>
    <addrs><addr address="+11111111110" type="137" charset="106"/><addr address="+11111111111" type="151" charset="106"/><addr address="+11111111112" type="151" charset="106"/></addrs>
  </mms>
</smses>

For example, the four SMS messages get correctly deduplicated into those from Sun Sep 13 08:26:40 AM EDT 2020 and those from Sun Sep 13 08:26:41 AM EDT 2020.

$ date -d @1600000000.345
Sun Sep 13 08:26:40 AM EDT 2020

$ date -d @1600000001.234
Sun Sep 13 08:26:41 AM EDT 2020