⚡ Fully automated web crawler. Crawling all information you want on the Internet with GPT-3.5. Built with 🦜️🔗LangChain👍👍⚡
demo.mp4
- Fully automated web crawler. Simulate the process of humans searching for data as much as possible.
- Automatically collect all specified details across the entire internet or given web domain based on a given theme.
- Automatically search for answers on the internet to fill in missing specified details while crawling.
- ✍️👇A simple exmple👇✍️
- Input:
- the theme you want to crawl:
Cases of mergers and acquisitions of fast food industry enterprises in America after 2010
- 0-th specific detail:
When the merger occurred
- 1-th specific detail:
Acquirer
- 2-th specific detail:
Acquired party
- 3-th specific detail:
The CEO of acquirer
- 4-th specific detail:
The CEO of acquired party
- (Optional) Limited web domain:
["nytimes.com", "cnn.com"]
- the theme you want to crawl:
- Output: JSON containing all specified details about the theme. The format of output is
{ "events_num": N, "details": ### The length of this list is N. [ { "When the merger occurred": <answer>, "Acquirer": <answer>, "Acquired party": <answer>, "The CEO of acquirer": <answer>, "The CEO of acquired party": <answer>, "source_url": <url> }, { "When the merger occurred": <answer>, "Acquirer": <answer>, "Acquired party": <answer>, "The CEO of acquirer": <answer>, "The CEO of acquired party": <answer>, "source_url": <url> }, .............. ] }
- Input:
- GPT can extract the necessary information by directly understanding the content of each webpage, rather than writing complex crawling rules.
- GPT can connect to the internet to determine the accuracy of crawler results or supplement missing information.
- Thinking about suitable Google search queries based on the theme with GPT-3.5.
- Simulate Google search in entire Internet or given web domain(if any) using each query.
- Browse every website.
- Extract specific details of the theme from the content of the website with GPT-3.5.
- Similar to Auto-GPT, it will independently search for missing details on the Internet based on the langchain implementation of MRKL and ReAct.
- Encapsulate all results into a JSON.
OPENAI_API_KEY
: You must have a openai api key and modifyos.environ["OPENAI_API_KEY"]
inpipeline.py
.SERPER_API_KEY
: For searching correct and real-time information, you need have a google serper api key. It will take you a short time to register. Modifyos.environ["SERPER_API_KEY"]
inpipeline.py
and you have 1000 queries for free every month.- Hyper Parameters:
QUERY_NUM
: The Number of Google searches with different query. Default is 2.QUERY_RESULTS_NUM
: The number of results returned per search. Default is 4.THEME
: The theme of web crawler.DETAIL_LIST
: The specific details of the web crawler theme.(Optional) URL_DOMAIN_LIST
: The valid web domain or url prefix.
- Install
python3.11
. - Install necessary dependencies:
pip install -r requirements.txt
- Run it:
python pipeline.py > output.txt
. - Read results from
final_dict.json
.
- Support crawl in given list of web domain.
- The langchain implementation of MRKL and ReAct carries the risk of divergent output. That is, the content of response may exceed our limit.
- Automatically write research reports based on crawling results.
- GPT consumes a huge amount of token while browsing webpage😢. Reduce the consumption.
- Browse the PDF files from the pdf link in website.
- Modify the entire pipeline to registration free(except for OpenAI).
I am currently working as an AI engineer at Alibaba in Beijing, China. I think communication can eliminate information gaps.
I am interested in llm applications, such as conversational search, AI agent, external data enhancement, etc. Welcome to communicate via email(hanxyz1818@gmail.com) or WeChat (if you have one).
I am also preparing to make an actual product. My immature ideas recently is about llm+crypto and llm+code. If you are also interested in them or you have some other ideas, also welcome to contact me.