Community Database Dump

Question

Community Database Dump

anindyamaiti opened this issue 5 years ago · 64 comments

I started fresh last month, and have ~1.7M torrents in my database after about 3 weeks. And, I plan to keep my magneticod running for the foreseeable future, expecting to add about ~50K per day after the initial spike converges.

There have been requests for database dump before, but no one has shared theirs in my knowledge. So, I thought of taking the initiative. Here is my website where I will share (via .torrent) my database dump 1-2 times a month: https://tnt.maiti.info/dhtd/

You can use it as-is or to get a head start with magneticod. And, don't forget to seed!

Huge thanks to @boramalper for making this project happen.

Answer 1 · 2019-10-14T06:04:03.000Z

I will soon create a page under https://kescher.at/magneticod-db for sharing my own SQLite database backups.

Answer 2 · 2019-10-14T06:37:02.000Z

Thanks @kescherCode. I suggest you take the same approach of sharing via .torrent.

As far as GitHub is concerned, I am confident that sharing an external webpage that contains .torrent files of self-created databases is not in violation of any rules. Neither the webpage posted here nor the torrent/database contains any copyrighted material.

Answer 3 · 2019-10-14T08:16:25.000Z

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

Answer 4 · 2019-10-14T08:17:34.000Z

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

Answer 5 · 2019-10-14T09:54:22.000Z

I have put up a page at https://kescher.at/magneticod-db now :)

Answer 6 · 2019-10-14T11:18:01.000Z

That's good news.
Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

There is a real issue here.
First, the sqlMerge repository doesn't work even by merging kd8bny/sqlMerge#3 and kd8bny/sqlMerge#4 since it relies on str(row) which doesn't work on BLOB.

Then, I explored manual merging. Obviously, torrents.id needs to be modified, but the issue is that there is a foreign key from files.torrent_id to torrents.id, so that the script should:

Read a record from merged_db
Insert it in original_db and record the insertion id
Modify merged_db.files.torrent_id according to the new id, and merge the merged_db.files rows.

This is not so difficult, but I don't think it can be done using a generic script.

Answer 7 · 2019-10-15T21:33:47.000Z

It may be ideal to compress the database before sharing.

Compressing with LZMA2 (7zip) on the 'fastest' preset (64K dictionary, 32 word size) yields a file 23.3% of the size of the uncompressed database.

Compressing on the 'normal' preset (16M dictionary, 32 word size) yields a file 15-18.2% of the size of the uncompressed database.

I'd suggest using the following command on linux systems with xz-utils

xz -7vk --threads=12 database.sqlite3

7 is the compression level (out of 9)
v is for 'verbose', shows you the progress of the compression in your TTY
k is for 'keep', meaning it wont delete your database after the compression is done >_>
--threads=12 specifies to use 12 threads, you can use --threads=0 to tell xz to use as many threads as there are CPUs on your server.

This will produce a file named database.sqlite3.xz

It can then be decompressed after being downloaded using unxz -v database.sqlite3.xz (or -vk if you want to keep the archive).

I've taken the liberty of compressing both of the shared databases so far: https://public.tool.nz/DHT/

Answer 8 · 2019-10-15T21:42:37.000Z

@AlphaDelta I agree, and I will compress my future torrents containing a database.sqlite3 with xz, however probably with the most aggressive preset you've ever seen:

xz --lzma=dict=1536Mi,nice=273

Answer 9 · 2019-10-15T21:56:46.000Z

@kescherCode That is indeed the most aggressive preset I've ever seen 😂

Just keep in mind the entire dictionary is loaded into memory when decompressing, so it would require allocating 1536MiB of memory just to decompress the database.

Probably not worth riding the exponential-cost train that far.

Answer 10 · 2019-10-16T01:59:54.000Z

That's a very significant size reduction using xz!

I will share both compressed and uncompressed versions from next time. Those who have a low-memory VPS for hosting magneticow, may want to directly download the uncompressed database.

Answer 11 · 2019-10-16T15:20:25.000Z

Someone is brave enough to write a small script to merge 2 databases?

BTW I was writing a simple and dirty tool to migrate old magneticod (Python) database data to the new version (including PostgreSQL and any another supported engine). ~~It's not optimized yet and uses too much memory (loads all torrents from the database at once)~~.

But if you're not scared of bad ~~and not optimized~~ code you can try to use it.

UPD: just pushed small README.md update and added ability to use not merged yet postgres and beanstalk engines as well as upstream's sqlite3 and stdout.

Answer 12 · 2019-10-26T13:42:56.000Z

Here it is : https://framagit.org/Glandos/magnetico_merge
It's hosted on another instance of GitLab, but you can login with your GitHub account if you want to contribute. If you want to fork it on GitHub, please let me know so I can follow your improvements :)

Answer 13 · 2019-10-26T14:53:04.000Z

@Glandos It worth mentioning that it's working only with SQLite databases.

Answer 14 · 2019-10-26T21:12:56.000Z

Yes indeed, but it is the only database currently supported :) And the only database that is shared. Sharing postgresql database for merging is not complex, but different.

Answer 15 · 2019-10-27T21:08:32.000Z

Merging Magneticod bootstrap 2019-10-14/database.sqlite3 into database.sqlite3
Gathering database statistics: 4835749 torrents to merge.
  [######################################################]  4836000/4835749
Comitting… OK. 4835749 torrents processed. 2820832 torrents were not merged due to errors.

Here it is. Now, I have a big merged database with your both database. More than 7 millions torrents with more than 216 millions file entries.
I should share this database when I'll have time.

Answer 16 · 2019-10-27T22:05:53.000Z

7 millions torrents

Did you not remove duplicates? I would guess that there would be significant overlap between the databases.

Answer 17 · 2019-10-28T08:25:48.000Z

There are a lot of overlap as you saw in the merge report : 2820832 torrents were not merged due to errors. The message is not clear, but it usually means that some constraint was violated, and the merge insert was skipped.

Answer 18 · 2019-10-31T19:05:23.000Z

Did you not remove duplicates?

What do you call a duplicate? Torrents with the same infohash couldn't be inserted again.

Answer 19 · 2019-11-02T20:59:11.000Z

Here is mine: https://antipoul.fr/dhtd/ This is very basic.
It includes databases from https://tnt.maiti.info/dhtd/ and https://kescher.at/magneticod-db at the time of writing.
It is huge (21GB after decompression), but it works on my Atom D2550, so it should work anywhere.

Answer 20 · 2019-11-09T14:15:01.000Z

@anindyamaiti Thanks for your regular updates. Your page is very nice. Do you think you can add an RSS feed? I know it's another thing to do :)

Answer 21 · 2019-11-09T16:54:23.000Z

Pinned! I think once we implement import & export functionality, it'd be even easier (and portable across different databases). =)

Closing because it's not an issue but feel free to keep the discussion & sharing going.

Answer 22 · 2019-11-09T17:05:44.000Z

Do you think you can add an RSS feed?

@Glandos I was thinking of the same. Here is a basic automated RSS feed of the ten most recently added files: https://tnt.maiti.info/dhtd/rss.php

Nothing fancy, just the filenames, but it should be good enough for a notification.

If anyone else is interested in incorporating RSS for their shares, here is my (dirty) PHP code: https://tnt.maiti.info/dhtd/rss.php.txt

Pinned!

@boramalper thanks for the pin! 😊

Answer 23 · 2019-12-01T08:03:51.000Z

@boramalper suggested I point to torrents.csv, an open repository of torrents / global search engine. Here's the issue for potentially adding people's data to this.

Answer 24 · 2020-02-03T14:24:20.000Z

I have a new dump of my own: https://antipoul.fr/dhtd/20200203_9.2M_magnetico-merge.torrent

Since my server is really low on CPU, I didn't use XZ and switch to zstandard. The output is larger, but much faster to compress / decompress.

No tracker inside, so I will be the first swarm.

Answer 25 · 2020-02-03T19:54:22.000Z

I, too, have released a new dump on https://kescher.at/magneticod-db.
It has around 6.4 million torrents in it.
It uses zstandard compression from now on as well.

Obviously not relying on trackers either, just the DHT.

In case your client allows manual adding of peers and your client doesn't seem to find a connection, feel free to add 185.21.216.171:55451 as peer.

Also, I may seed other dumps here in order to increase availability for people that want to bootstrap their db, hence why I call my torrents "Magneticod bootstrap".

Answer 26 · 2020-04-16T17:01:38.000Z

I have released a new dump, having roughly 10.8 million torrents.

You can get it here.

If you can't find any peers through DHT, add 185.21.216.171:55451 as peer if your client allows it.

Answer 27 · 2020-07-05T15:22:37.000Z

@kescherCode could you update your dump? I'd offer to host it as a direct download on one of my servers.

I considered sharing my version but figured your public magnetico instance has over 11.3 million torrents now which makes my 8 million look rather pale in comparison.

Answer 28 · 2020-07-05T15:24:48.000Z

@19h I will soon create a new torrent, yes. However, do feel free to share your dump as well, as dumps can be combined together to create a bigger database. Some people find torrents my instance can't find ;)

Answer 29 · 2020-07-05T20:06:14.000Z

If you interested I can share dump of an instance which uses PostgreSQL to store collected data.

magneticod=# SELECT COUNT(id) FROM magneticod.torrents;
  count   
----------
 14546639

Answer 30 · 2020-07-05T22:13:22.000Z

@kescherCode I tar-balled my dump for you here: https://r1.darknet.dev/magnetico-20200705-22b4c048c924e825d147a3fce7cb43f826fae221.tar (24'993'610 KB ~ 24G; server has a unmetered 10Gbit uplink, go wild).

@skobkin I'd love to have that! If we merge all our dumps, we may get a new super database. Would be exciting!

Answer 31 · 2020-07-06T08:48:01.000Z

@skobkin I'll be glad to try to merge a dump from you postgres. I wrote the merger in python, that is only able to read sqlite for now. I guess that importing a dump in postgres and then merging is too big, so I'll try to read the dump in itself. But for now, the output will still be sqlite3.

EDIT: if you can go with pg_dump in custom format, I'll be able to use https://github.com/gmr/pgdumplib

Answer 32 · 2020-07-06T08:50:39.000Z

@19h I am currently merging your dump, thanks! But next time, you should at least compress it with gzip or zstd :)

Answer 33 · 2020-07-06T10:22:46.000Z

@Glandos sorry about that. The server has more bandwidth than CPU.. :-)

Answer 34 · 2020-07-06T10:44:21.000Z

@Glandos Dump is being made right now. But it's in plain format. I probably can make another custom dump later.

I've also thought that I could've made import/export feature for my magnetico-web. As I said in #197, the already supported JSON Lines format would be one of the best solutions to reach maximum interoperability between database backends and other implementations even not working with magnetico itself.

Already created issue for myself :)

Answer 35 · 2020-07-06T13:47:01.000Z

OK, my server also has a small CPU (Atom D2550), but here is my fresh dump, consolidated with all known database to date: https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent

Answer 36 · 2020-07-06T14:05:54.000Z

@Glandos Amazing! Thanks!

Answer 37 · 2020-07-06T15:25:27.000Z

@Glandos I've almost wrote you a question about possibility to implement JSON export in your tool to be able to convert SQLite databases to the JSON files.

But then I realized that I've already did write a migration tool which uses magnetico's own database persistence layer to store data. So it should also work with you SQLite dumps or it it wouldn't it should be easy to fix because there was very slight difference between old and new magnetico database schemas.

When I have time I'll try to download one of your dumps and check if it works with my migrator tool.

If it works, then we should be able to export also any (new) magnetico database to any of supported formats: JSON (stdout), PostgreSQL (postgres if using my fork) or any other backend which will be implemented in magnetico.

In the end I should be able to convert your SQLite dumps to the JSON using stdout driver and then import it to my instance. So I'll be able to benefit from these community dumps too :)

Answer 38 · 2020-07-06T22:03:42.000Z

@Glandos I can't find any peers to download from -- can you provide me with a direct download link? I'll make sure it's seeded.

Answer 39 · 2020-07-07T09:36:11.000Z

@Glandos I remembered yesterday that I have backups of that database in pg_dump custom format. So here it is:
https://mega.nz/file/U15QSCpD#DzCfMNQNRJX21vkGb6gbcAMWLf6ZFmg4ej7JsMXDsAc

Let me know when I can delete it.

Answer 40 · 2020-07-07T10:32:32.000Z

@Glandos Your torrent has no peers.

Also, can you take a look at the merge requests for magnetico_merge? I made two a while ago.

Answer 41 · 2020-07-07T15:06:05.000Z

@19h Small hint for the future: If you want to share your sqlite3 file, before sure to manually open it with sqlite3 and execute PRAGMA wal_checkpoint(TRUNCATE);. That way, -shm and -wal files are written to the database, and deleted afterwards.

Answer 42 · 2020-07-07T16:16:53.000Z

@19h and @kescherCode I recreated the torrent with announce URIs in it. My client announced it, so now, you should be able to find me. But you need to reimport the torrent (from https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent) as I've updated it.

@skobkin I can't download from MEGA because the file is too big, and it requires me to install an extra software. I won't do this, sorry :)

Answer 43 · 2020-07-07T16:23:08.000Z

@Glandos I'm removing it then 🤷

Answer 44 · 2020-07-08T15:25:41.000Z

My latest dump, containing 13.8 million torrents.
Magnet link available separately here

I will make sure this file is well-seeded by a fast connection as well as my home connection.

Answer 45 · 2020-07-08T15:31:32.000Z

@19h see updated dump above.

Answer 46 · 2020-07-08T18:19:45.000Z

@Glandos @kescherCode that's amazing, thanks both of you!

Answer 47 · 2020-07-08T19:49:44.000Z

@Glandos @kescherCode jfyi I fetched both dumps and my seedbox is seeding them.

Answer 48 · 2020-07-08T20:06:55.000Z

I'm seeding your torrents.

Also ... I'm currently writing a merging tool in Rust so that it's a bit faster, but my ideal future of this would be migrating off sqlite to leveldb (or the fb fork rocksdb). I'm also playing with the idea of building a frontend searching the database using tantivy, but that's a bit of a stretch goal ..

It would be cool if we could have a semi-dht where we can interconnect our instances so that they act as isolated sattelites for each other..

Answer 49 · 2020-07-09T17:37:03.000Z

how to install this project on vps?

Answer 50 · 2020-07-11T13:20:28.000Z

@sunnymme this isn't the right place for this question. Check the readme, check other issues or create one ..

Answer 51 · 2020-07-11T14:50:56.000Z

thanks a lot. I have created a issue for this question. magnetico is really good. I want to setup it. But I don't know how to do it. 

…

------------------ 原始邮件 ------------------ 发件人: "Kenan Sulayman"<notifications@github.com>; 发送时间: 2020年7月11日(星期六) 晚上9:20 收件人: "boramalper/magnetico"<magnetico@noreply.github.com>; 抄送: "1059777607"<1059777607@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [boramalper/magnetico] Community Database Dump (#218) @sunnymme this isn't the right place for this question. Check the readme, check other issues or create one .. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 52 · 2020-07-18T19:31:31.000Z

Here is the dump of my database. 2.64M torrents. My database is not merged with any other database.
Compressed zst file is 2.5GB. The sqlite3 file is 8.3GB.
You can find my database at https://dyonr.nl/magnetico/
Preferable, use the .torrent to download it, instead of downloading the zst file.
The torrent is loaded on my seedbox (1Gbit/s), the .zst on my server which is limited to 200Mbit/s.

Answer 53 · 2020-07-18T21:32:21.000Z

Thanks a lot. I will try it. Your are so kindness. Sincercely yours, Sunny

…

------------------ 原始邮件 ------------------ 发件人: "DyonR"<notifications@github.com>; 发送时间: 2020年7月19日(星期天) 凌晨3:31 收件人: "boramalper/magnetico"<magnetico@noreply.github.com>; 抄送: "1059777607"<1059777607@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [boramalper/magnetico] Community Database Dump (#218) Here is the dump of my database. 2.64M torrents. My database is not merged with any other database. Compressed zst file is 2.5GB. The sqlite3 file is 8.3GB. You can find my database at https://dyonr.nl/magnetico/ Preferable, use the .torrent to download it, instead of downloading the zst file. The torrent is loaded on my seedbox (1Gbit/s), the .zst on my server which is limited to 200Mbit/s. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 54 · 2020-07-18T23:51:53.000Z

@DyonR I'm seeding your database now, not merging it in yet until the next time I'll dump

Answer 55 · 2020-11-12T20:15:38.000Z

Here is my fresher dump: https://antipoul.fr/dhtd/20201112_14.1_magnetico-merge.torrent

Unfortunately, it seems to be a bit stalling. Sometimes, I have 0.1 torrent per second, but it is usually 10 times less…

Answer 56 · 2020-11-15T12:01:34.000Z

Since some of you have millions of torrents maybe you are interested in add support for other databases that scale better than SQLite. Some users are having request timeouts in magneticow due to poor performance in SQLite.

I don't have time to work on this issue, but maybe some of you do. Jackett/Jackett#10174 (comment)

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too. There is an attempt to include Postgres but I think it's abandoned #214

Answer 57 · 2020-11-15T12:44:00.000Z

@ngosang It's not abandoned, it's working for me more than a year for now 😄

I've just forgot about it because Bora didn't answer to my question. I think I can make the last change he asked soon, but I'm not sure he'll merge it because he's not supporting magnetico for a long time.

UPD: You can test it using this Docker image: https://hub.docker.com/r/skobkin/magneticod

Answer 58 · 2020-11-15T12:47:05.000Z

@ngosang commented on 15 nov. 2020 à 13:01 UTC+1:

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too.

Since magneticod is not using 100% of a CPU, I don't think this is the current bottleneck.

Answer 59 · 2020-11-15T12:53:29.000Z

@Glandos SQLite is really a bottleneck sometimes. It'll not use 100% of CPU because it's most likely using 100% of the disk.
Probably you can tweak SQLite when initializing the client to use very big caches and so on, but I'm not sure that it'll outperform MySQL or PostgreSQL though.

I don't have time to check it, so I can be wrong. If someone can check the disk usage (IOPS, throughput, latency) when searching torrents in VERY LARGE database, let us know.

Answer 60 · 2020-11-15T13:11:36.000Z

From my experience as software architect if you have a 10GB database, the SQLite read performance is between 100 and 1000 times slower than other relational databases like MySQL, Postgres, Oracle.
If the entire database does not fit in memory then all databases have to read from disk at some point. The difference is that SQLite does not have several levels of cache in memory with the table indexes, most common queries, etc. With each query you have to read much more data from disk than other databases. I saw 32GB exports in this post. You should notice an amazing improvement in both indexing and search.

Answer 61 · 2020-11-23T18:53:37.000Z

BTW, I've just updated the PR with PostgreSQL eliminating the last "problem" which was pointed a year ago.

Answer 62 · 2020-11-27T17:37:36.000Z

It was merged!

Answer 63 · 2020-11-27T19:30:41.000Z

@skobkin Now, how do I migrate my data from SQLite to Postgres? lol

Answer 64 · 2020-11-28T22:52:17.000Z

@kescherCode See this comment.

Be aware that magneticow does not work with PostgreSQL as of now.