Deduplication misses messages with inconsistent timestamp precision
ragibson opened this issue · 1 comments
ragibson commented
It looks like some backup and/or recovery agents trim timestamp precision to the seconds level only (rather than millisecond).
That can cause two messages to truly be different since they technically appear to be received at different times, but they should be removed anyway.
ragibson commented
Now, when running on
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<smses count="7" type="full">
<sms protocol="0" address="+11111111111" date="1600000001234" type="1" subject="null" body="Some text content here" />
<sms protocol="0" address="+11111111111" date="1600000000345" type="1" subject="null" body="Some text content here" />
<sms protocol="0" address="+11111111111" date="1600000001000" type="1" subject="null" body="Some text content here" />
<sms protocol="0" address="+11111111111" date="1600000001789" type="1" subject="null" body="Some text content here" />
<mms date="1700000001234" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
<addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
</mms>
<mms date="1700000000345" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
<addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
</mms>
<mms date="1700000001000" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
<addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
</mms>
<mms date="1700000001789" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106" /></parts>
<addrs><addr address="+11111111110" type="137" charset="106" /><addr address="+11111111111" type="151" charset="106" /><addr address="+11111111112" type="151" charset="106" /></addrs>
</mms>
</smses>
with default arguments, nothing is deduplicated since the timestamps are different.
However, running with the --ignore-date-milliseconds
flag deduplicates to
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<smses count="4" type="full">
<sms protocol="0" address="+11111111111" date="1600000001234" type="1" subject="null" body="Some text content here"/>
<sms protocol="0" address="+11111111111" date="1600000000345" type="1" subject="null" body="Some text content here"/>
<mms date="1700000001234" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106"/></parts>
<addrs><addr address="+11111111110" type="137" charset="106"/><addr address="+11111111111" type="151" charset="106"/><addr address="+11111111112" type="151" charset="106"/></addrs>
</mms>
<mms date="1700000000345" address="+11111111111~+11111111112">
<parts><part seq="0" ct="text/plain" name="null" chset="106"/></parts>
<addrs><addr address="+11111111110" type="137" charset="106"/><addr address="+11111111111" type="151" charset="106"/><addr address="+11111111112" type="151" charset="106"/></addrs>
</mms>
</smses>
For example, the four SMS messages get correctly deduplicated into those from Sun Sep 13 08:26:40 AM EDT 2020
and those from Sun Sep 13 08:26:41 AM EDT 2020
.
$ date -d @1600000000.345
Sun Sep 13 08:26:40 AM EDT 2020
$ date -d @1600000001.234
Sun Sep 13 08:26:41 AM EDT 2020