(no commit for long time doesn't mean this project is stalling but reached a stable state and I switched to others activities
When talking about ultra-long-term storage, data integrity quickly becomes a challenge. Software bugs, human errors, hardware failures are the most obvious causes of data loss... but sometimes data degrades "on their own", with no visible external cause. Phenomenon known as data decay, data rot or bit rot.
A common strategy is to multiply an odd number of backups, compare them and apply a rule where "the majority gets the vote". But it's leading to a costly and slow solution.
Mer-de-Glace stores an electronic signature when a file is created. The user can ensure the sanity of its data by comparing this archived signature with the current state.
As a consequence, we are needing replicas only to overcome the failures of the master and not anymore to check data integrity.
❗ | See Use Cases directory for comprehensive examples and use cases. |
---|
Mer-de-Glace is a command-line tool targeting to run on slow headless machines : my goal was to recycle obsolete boxes as backup server. The drawback, unfortunately, mainstream Linux distributions are dropping such obsolete hardwares leading to compilation problem as Mer-De-Glace is requiring C++20 compliant compiler.
But alternatives exist :
- some distributions like TinyCoreLinux are still supporting old hardware (32 bits, low memory, low processor power). However, before attempting to install from source, check that a binary package does not exist (I will make one for TinyCoreLinux x86-32b).
- To the extreme, Mer-De-Glace binary can be installed only on one machine : states will be calculated remotely (as explained in "use cases").
- OpenSSL (development version)
- C++-20 compliant tool chain
- From a scratch directory clone Mer-De-Glace repository
git clone https://github.com/destroyedlolo/Mer-de-Glace.git
cd Mer-de-Glace
make
You will get 2 binaries :
- MerDeGlaced is the master daemon managing data
- MdG command line tool
With :
rootDirectory=
the root directory of document to trackDBFile=
where the state backup is stored
💡 Different data kinds (photo, music, films) ? Run a MerDeGlaced for each of them, using customized configuration file with dedicated rootDirectory, DBFile and rendez-vous.
MerDeGlaced &
Notez-bien : loading an existing backup for large amount of data may be long. Add verbosity -v to know when application is ready. Alternatively, the "rendez-vous" socket is created only when the deamon is ready.
./MdG scan
Retrieve the current status of your monitored directory tree.
It will take a long time, depending on your disk speed, CPU workforce, number and size of files to handle.
./MdG report
[D][Deleted] /home/laurent/Images/Brute/_AArchiver/test/tst
[F][Deleted] /home/laurent/Images/Brute/_AArchiver/test/tst/truc
- If it's the initial scan, it will report all files as [Created] : you're starting from an empty database and all files seems new.
- If it's not the initial scan, each discrepancy needs to be investigated :
accept
those legitimate.
💡 | Accept ing deletion of a directory will commit as well the deletion of all it sub objects. |
---|
Notez-bien :
- don't forget to
save
the state after validating all the discrepancies, otherwise they will reappear when the daemon is restarted. - it's not possible to validate checksum issues : they are highlighting potential hardware problems leading to severe data loss.
./MdG save
Notez-bien : it's about comparing numerical signature, not the files themselves. It's up to YOU to decide if some cleaning is needed or not.
./MdG -f ~/Config/Musiques.mdg duplicate
Potential duplicate found :
/mnt/sda4/Musiques/Noir Désir/Noir Désir - 1993 - Tostaky/03 - Oublié.mp3
/mnt/sda4/Musiques/Noir Desir/1993 - Tostaky/03 - Oublie.mp3
Potential duplicate found :
/mnt/sda4/Musiques/Noir Désir/Noir Désir - 1994 - Dies Irae/11 - It Spurts.mp3
/mnt/sda4/Musiques/Noir Desir/1994 - Dies Irae/11 - It Spurts.mp3
Potential duplicate found :
/mnt/sda4/Musiques/Noir Désir/Noir Désir - 2001 - Des Visages Des Figures/Noir Désir - 2001 - Des Visages Des Figures - Back.jpg
/mnt/sda4/Musiques/Noir Desir/2001 - Des Visages Des Figures/Noir Désir - 2001 - Des Visages Des Figures - Back.jpg
...
I made a mistake by converting twice my CD to MP3, using a different naming convention : I have to delete ".../Noir Désir" directory (using shell's rm -rf
, a graphical interface or whatever).
In order to speed up the operation, restrict
to the directory changed.
./MdG -f ~/Config/Musiques.mdg restrict "/mnt/sda4/Musiques/Noir Désir/"
and finaly, launch a new scan.
./MdG -f ~/Config/Musiques.mdg scan
./MdG -f ~/Config/Musiques.mdg report
[D][Deleted] /mnt/sda4/Musiques/Noir Désir
[D][Deleted] /mnt/sda4/Musiques/Noir Désir/Noir Désir - 1994 - Dies Irae
[F][Deleted] /mnt/sda4/Musiques/Noir Désir/Noir Désir - 1994 - Dies Irae/13 - The Holy Economic War.mp3
[F][Deleted] /mnt/sda4/Musiques/Noir Désir/Noir Désir - 1994 - Dies Irae/21 - I Want You (She'S So Heavy).mp3
...
./MdG -f ~/Config/Musiques.mdg accept '/mnt/sda4/Musiques/Noir Désir'
./MdG -f ~/Config/Musiques.mdg save
MdG is the command line client to communication with MerDeGlaced daemon
./MdG [-opt] command [arguments ...]
with ./MdG -h
to get list of supported options. ./MdG help
to get the list of commands known by the daemon.
MdG issues following return codes :
- 0 : everything goes right
- 100 : something has been found by
report
orduplicate
commands - 1 : technical issue
So following code can be used for automation needs
if ./MdG -f tst.conf report > /dev/null
then
echo "everything is fine"
else
if [[ $? -eq 100 ]]
then
echo "something found"
else
echo "technical issue"
fi
fi
Reports of state discrepancies as :
[D][Created] /home/laurent/Images/Brute/_AArchiver/test/new
[F][Changed] /home/laurent/Images/Brute/_AArchiver/test/toto
With
- the object type can be
[D]
for a directory or[F]
for a file. [Created]
,[Deleted]
,[Changed]
as the names said[Replica only]
,[Master only]
,[Discrepancy]
while comparing an alternate root[Bad CS]
the checksum doesn't correspond to the signature highlighting server corruption[ERROR]
an issue has been encountered while processing (typically, a file that is not readable)
Mer-De-Glace maintains in memory files' state. You can (have) to save
it to retrieve it at restart and check if data remains safe.
Notez-bien :
- When the state is saved, using
save
command, all files/directories creation are de facto accepted. - modification and deletion are pending as they may highlight a storage issue.
RESET command will reset the state of each file/directory as clean item.
As discrepancies will be lost, is command is dangerous and need to be used with caution !
👀 | After a RESET, in memory state is not anymore consistent until the next scan and all identified discrepancies are lost. In case of doubt, restart MerDeGlaced without saving and then launch a scan : it will reset data as per the real situation |
---|
Mer-de-Glace keeps internal checksums to ensure in memory state as well as backup ones are not corrupted. In very rare occasions, rebuilding them is needed : it's the goal of RECS (for recalculate checksum).
This command is very dangerous as checksum discrepancy is a proof of something going very bad (disk being corrupted, memory fault, hardware failure, ...). Consequently, this command is allowed ONLY if the daemon as been started in debug mode. |
---|
This is the list of identified tasks/behaviors.
-
data management
- Recursively scan a directory with MD5 checksum (v0.1)
- smart status reset before scanning to avoid usage of explicite
RESET
command (v0.9) - Save / load state (v0.2)
- Restrict scanning to a sub directory (v0.3)
- re-scan and issue a report (v0.4)
- Accept a discrepancy (v0.6)
- Guess duplicate entries (v0.8)
- verify in memory and backup integrity (v0.7)
- Can use alternate root (v0.8)
-
interfaces
- accept commands via a socket (v0.4)
- grouped acceptation (i.e. : accept all deletions, all creations, all modifications, ...). Restrictions apply.
- daemonize (avoid as much as possible exiting in case of issue) (v0.11)
- Command line tool (v0.5)
- long standing commands are aborted when client connection is lost (v0.11)
- Shell file name completion
- Generate return code to make automatic scripts easier (v0.10)
-
for the future
- access to remote stats (is it really useful ? Mounting remote FS and using alternate root is already doing the job, see Use Cases)
- local configuration file (à la .access)
- versioning
- Daemon dashboard : GUI, replicas auto discovering, status of each replica, central place for replications management ...
-
Questionable Stuffs I'm thinking about but having big impacts, imply issues or potentially not useful.
- Asynchronous action ➡️ Will require deep architecture review and makes the source code more complex (semaphores, how to handle a file processing if the user already asks for its deletion, ...). And in any case, as the disk IO is definitively a bottleneck, is it really useful ?
- file system notification ➡️ primary test highlights the notification is not fully reliable. As processing a file may be long, asynchronous actions would be needed and probably an action queue as well. Frankly speaking, it will also encourage the laziness of users, leading to less frequent full scan.
Mer-de-Glace is covered by Creative Commons-BY-NC preventing vampires to abuse open source developers kindness : Please raise a ticket if you want to integrate it in a commercial product.
Feel free to participate : code improvements, new features implementation, beer to developers, gifts, thanks messages 👏 ... Participations help to make projects alive.