I'm actually using this program in production! Updates are made when I'm interested, or when they're necessary.
Created for injesting URLs from DHT scrapes (and now discard2 ones, too!) (oh wait: now DiscordChatExporter ones, three!) into the Archiveteam URLs project. But you can use the URL lists for whatever you please! (Except DDoSing. Please don't do that. I don't want to be your partner in crime, believe it or not.)
Might not extract all URLs correctly. #8 improved on this, though.
urls.url
with the URLs from the current script. A good idea is to use a separate directory for each scrape (Cargo's --manifest-path
can help with that) or back up the urls.url file before you run the script.
Tip: ignores.url
is a list of URLs that should NOT be extracted. This allows you to run the script on different scrapes without having duplicates. It's also helpful if you have previously used other tools to scrape URLs - just add the URLs you scraped to ignores.url and they won't be scraped again! It's not perfect, but it should work 99% of the time. ignores.url
is read into memory, new URLs are added to it, and then ignores.url
is overwritten with the loaded values. As such, don't modify ignores.url
while the script is running - your changes will have no effect, and when the script finishes they will be overwritten!
(Also, I think this goes without saying, but don't run the app multiple times at the same time in the same folder. That's a recipe for turnabout disaster.)
it doesn't support embeds or any new DHT features. It only supports extracting attachment urls (which can now be done via DHT!) and finding URLs in messages using a regex (and getting avatar urls). It does not look through embeds or any other feature.
Once you've got your .dht file, run:
cargo r <file path> dht
Of course, if the file path has spaces, pad it with quotes ("
) if your shell requires it.
Example: cargo r /home/thetechrobo/Discordbackups/dsicord_data/SteamgridDB/SteamGridDB.dht dht
Obviously replace the file path with the actual one. Unless, of course, you have the exact same path as me - in which case, twinsies!
- Use the
raw-jsonl
reader from discard2. Save its output into a file. Do not name that fileurls.url
orignores.url
. - Run
cargo r messages.jsonl discard2
, assuming you named the file in Step 2messages.jsonl
. - Use a program to get rid of any duplicate URLs. (There shouldn't be any, but I'm not perfect.) On *nix you can use
sort -u
oruniq
.
To get even more data (server emojis, role icons, and more), --parse-websockets
. Note that you then have to specify the --guild-id
(server ID; can be found in the state.json file or by right clicking on the server in Discord's UI and hitting "Copy ID") because I'm wayyy too lazy to try to autodetect what server the crawl is in.
--media
option ! Doing so will replace the URLs in the json with the path to the local file, which will cause the URL list to have paths to local files instead of HTTP resources. So don't, I don't know, do a day-long crawl of a huge server until you realise that the attachment urls are all screwed up. (Ask me how I know.)
DiscordChatExporter is now supported! You can only run one channel at a time, though, and you must use the JSON output format. CSV may be supported in the future. Usage is:
cargo run /path/to/channel.json dce
To run an entire folder of JSONs, you could run a script. For example, here's the script I use (tested on zsh, probably won't work on windows, might work on bash):
for i in *.json ; do echo $i ; <PATH_TO_EXECUTABLE> "$i" dce && { cat urls.url >> urls.url_finished; rm urls.url; continue }; echo FAILED; break ; done
If you have some plain text files, you can use them directly. That will find all URLs saved in the file, or at least most of them. I think.
cargo r <file path> plaintext
Licenced under the Apache 2.0 licence. Copyright (C) TheTechRobo, 2021-2022.
Copyright 2021-2023 TheTechRobo
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.