English | 简体中文
Discourse-based websites chat data crawling and on-the-way parsing straight for LLM instruction finetuning. Data include the texts, images (crucial for multimodal finetuning) and links. Will support more than Discourse-based websites soon.
Run
git clone https://github.com/jackfsuia/chats-crawler.git && cd chats-crawler
Then install the requirements, run
npm i
Before crawling, please read the Notice. Config the target website at config.ts, edit the url
and rex
properties to match your needs, i.e., replace the two https://discuss.pytorch.org
s there with your target Discourse-based website. A Discourse-based website basically all looks like this:
To start crawling, run
npm start
That's all! The discourse chat data are saved at storage/datasets/default
as .json files, and the images at storage/datasets/imgs
.
Lets say we crawling https://discuss.pytorch.org. We should edit the config.ts as:
...
url: "https://discuss.pytorch.org/",
...
rex: "https://discuss.pytorch.org/t/[^/]+/[0-9]+$",
One of the chat page we have crawled might be this one:
then at one of the .json files in storage/datasets/default
, the "conversations"
property inside will be
<# ztf-ucasTengfei Zhang #>:
How to delete a Tensor in GPU to free up memory?
I can get a Tensor in GPU by Tensor.cuda(), but it just returns a copy in GPU. I wonder how can I delete this Tensor in GPU? I try to delete it with “del Tnesor” but it doesn’t work.
Quote:"
Could you show a minimum example? The following code works for me for PyTorch 1.1.0:
import torch
a = torch.zero(300000000, dtype=torch.int8, device='cuda')
b = torch.zero(300000000, dtype=torch.int8, device='cuda')
# Check GPU memory using nvidia-smi
del a
torch.cuda.empty_cache()
# Check GPU memo…
"
<# smth #>:
del Tensor will delete it from GPU memory. Why do you think it doesn’t work?
<# ztf-ucasTengfei Zhang #>:
Thank you very much!
I loaded an OrderedDict of pre-trained weights to gpu by torch.load(), then used a for loop to delete its elements, but there was no change in gpu memory.
Besides, it is strange that there was no change in gpu memory even I deleted the OrderedDict of pre-trained weights.
Pytorch version is 0.4.0.2
...
<# ztf-ucasTengfei Zhang #>
and <# smth #>
are the two posters' names, and are formatted this way for you to easily template it to instruction-finetune LLMs (e.g., maybe replace <# smth #>
with <assistant>
, and <# ztf-ucasTengfei Zhang #>
with <user>
, etc.). If there are images interspersed in the texts, they will not only be downloaded and saved in storage/datasets/imgs
with a new FILENAME, but also replaced in place with "[img FILENAME]"
in texts. If there are links interspersed in the texts, they will be replaced in place with "[link LINK]"
in texts. All the other elements are deleted.
Do you like this repo? Give us a ⭐
Make sure by yourself the crawling is legal, check the website's robots.txt if you're not sure. We are not responsible for any law risks and issues.
- Support image data auto OCR to texts, then inserted among original texts data. It makes the data complete in text form, and save some space too if OCR happens when on the crawling, not post crawling.
chats-crawler is licensed under the MIT License found in the LICENSE file in the root directory of this repository.
If this work is helpful, please kindly cite as:
@article{chats-crawler,
title={chats-crawler: discourse chat data crawling and parsing for LLM instruction finetuning.},
author={Yannan Luo},
year={2024},
url={https://github.com/jackfsuia/chats-crawler}
}
Learned a lot from gpt-crawler and crawlee. Thanks for their wonderful works.