Main Idea: A Langchain for data collection
DataCollector is a Python library designed to collect data from unstructured sources using language models. The goal is to provide a simple API for collecting data for specific tasks (summarization, custom ner, ...), and crawling web pages.
In the future, we plan to integrate a tool like Argilla
to supervise the data generation process.
The Collector
is designed to collect data for a specific task, such as sentiment analysis. Here's an example demonstrating how to use the Collector for collecting sentiments from the XSum dataset:
from datacollector.functions import Classifier
from datacollector import Collector, CollectorArgs, Task
function = Classifier(
name="print_sentiment",
description="A function that prints the given sentiment.",
output="sentiment",
output_description="The sentiment.",
labels=[
"positive",
"negative",
"neutral",
],
)
task = Task(
function = function,
inputs = ["text", "role"],
outputs = ["sentiment"],
system = "You are a {role}.",
prompt = "Classify the following text:\n{text}",
language = "en",
)
args = CollectorArgs(
task=task,
dataset="xsum",
model="gpt-3.5-turbo-0613",
max_items=1000,
batch_size=10,
output_dir="output",
save_every=100,
push_to_hub=True,
)
collector = Collector(args)
collector.build()
collector.push_to_hub()
The Crawler
is designed to crawl web pages and collect data based on specified criteria. The following example demonstrates how to use the Crawler to crawl a single domain and collect data from the pages:
from datacollector import Crawler, CrawlerArgs
crawler_args = CrawlerArgs(
start_urls=["https://www.example.com"],
max_depth=2,
output_path="output",
delay=0.5,
)
url_filter = UrlFilter()
url_filter.add_domain('example.com')
crawler = Crawler(args=crawler_args)
crawler.run(url_filter=url_filter)
crawler.push_to_hub()
- Collect data from various sources for specific tasks
- Crawl web pages and apply custom filters
- Easy-to-use API for data collection and processing
We welcome contributions! If you have any ideas or suggestions, please open an issue to discuss them. If you'd like to contribute code, fork the repository, make your changes, and submit a pull request.
This project is licensed under the MIT License.