/discord-urls-extractor

Rust program for extracting most URLs from Discord scrapes. Works with Discord History Tracker, discard2, and DiscordChatExporter.

Primary LanguageRustApache License 2.0Apache-2.0

discord-urls-extractor

Maintainence Level: Updated as I need it.

I'm actually using this program in production! Updates are made when I'm interested, or when they're necessary.

Information

Created for injesting URLs from DHT scrapes (and now discard2 ones, too!) (oh wait: now DiscordChatExporter ones, three!) into the Archiveteam URLs project. But you can use the URL lists for whatever you please! (Except DDoSing. Please don't do that. I don't want to be your partner in crime, believe it or not.)

Might not extract all URLs correctly. #8 improved on this, though.

⚠️ Every time you run the script, it will overwrite urls.url with the URLs from the current script. A good idea is to use a separate directory for each scrape (Cargo's --manifest-path can help with that) or back up the urls.url file before you run the script.

Tip: ignores.url is a list of URLs that should NOT be extracted. This allows you to run the script on different scrapes without having duplicates. It's also helpful if you have previously used other tools to scrape URLs - just add the URLs you scraped to ignores.url and they won't be scraped again! It's not perfect, but it should work 99% of the time. ignores.url is read into memory, new URLs are added to it, and then ignores.url is overwritten with the loaded values. As such, don't modify ignores.url while the script is running - your changes will have no effect, and when the script finishes they will be overwritten!

(Also, I think this goes without saying, but don't run the app multiple times at the same time in the same folder. That's a recipe for turnabout disaster.)

Usage with Discord History Tracker (DESKTOP APP ONLY)

⚠️ Note that the DHT extractor is pretty much unmaintained since I no longer use it. I'll fix bugs, but it doesn't support embeds or any new DHT features. It only supports extracting attachment urls (which can now be done via DHT!) and finding URLs in messages using a regex (and getting avatar urls). It does not look through embeds or any other feature.

Once you've got your .dht file, run:

cargo r <file path> dht

Of course, if the file path has spaces, pad it with quotes (") if your shell requires it.

Example: cargo r /home/thetechrobo/Discordbackups/dsicord_data/SteamgridDB/SteamGridDB.dht dht

Obviously replace the file path with the actual one. Unless, of course, you have the exact same path as me - in which case, twinsies!

Usage with discard2

⚠️ These steps have changed! (The old steps will still work, but the new ones are simpler and get more URLs.)

  1. Use the raw-jsonl reader from discard2. Save its output into a file. Do not name that file urls.url or ignores.url.
  2. Run cargo r messages.jsonl discard2, assuming you named the file in Step 2 messages.jsonl.
  3. Use a program to get rid of any duplicate URLs. (There shouldn't be any, but I'm not perfect.) On *nix you can use sort -u or uniq.

To get even more data (server emojis, role icons, and more), --parse-websockets. Note that you then have to specify the --guild-id (server ID; can be found in the state.json file or by right clicking on the server in Discord's UI and hitting "Copy ID") because I'm wayyy too lazy to try to autodetect what server the crawl is in.

Usage with DiscordChatExporter

⚠️ DiscordChatExporter's extractor will use a ton of memory for large channels. This is due to both limitations in the file format, and limitations in the JSON library I'm using.

⚠️ If you go this route, you CANNOT run DiscordChatExporter with the --media option ! Doing so will replace the URLs in the json with the path to the local file, which will cause the URL list to have paths to local files instead of HTTP resources. So don't, I don't know, do a day-long crawl of a huge server until you realise that the attachment urls are all screwed up. (Ask me how I know.)

DiscordChatExporter is now supported! You can only run one channel at a time, though, and you must use the JSON output format. CSV may be supported in the future. Usage is:

cargo run /path/to/channel.json dce

To run an entire folder of JSONs, you could run a script. For example, here's the script I use (tested on zsh, probably won't work on windows, might work on bash):

for i in *.json ; do echo $i ; <PATH_TO_EXECUTABLE> "$i" dce && { cat urls.url >> urls.url_finished; rm urls.url; continue }; echo FAILED; break ; done

Usage with plain text

If you have some plain text files, you can use them directly. That will find all URLs saved in the file, or at least most of them. I think.

cargo r <file path> plaintext

Licence

Licenced under the Apache 2.0 licence. Copyright (C) TheTechRobo, 2021-2022.

Copyright 2021-2023 TheTechRobo

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.