/seldonite

A News Article Collection Library

Primary LanguagePythonMIT LicenseMIT

Seldonite

A News Article Collection and Processing Library

Define a news source, set your search method, and collect news articles or create news graphs.

Usage:

import os

from seldonite import sources, collect, run

aws_access_key = os.environ['AWS_ACCESS_KEY']
aws_secret_key = os.environ['AWS_SECRET_KEY']

source = sources.news.CommonCrawl(aws_access_key, aws_secret_key)

collector = collect.Collector(source) \
    .on_sites(['cbc.ca', 'bbc.com']) \
    .by_keywords(['afghanistan', 'withdrawal'])

graph = graphs.Graph(collector) \
    .build_tfidf_graph()

articles_df, words_df, edges_df = run.Runner(graph)
    .to_pandas()

Please see the wiki for more detail on sources and methods

Setup

To install seldonite as editable, and dependencies via conda:

conda env create -f ./environment.yml

This library uses a variety of third party libraries, please see limited setup instructions below:

Spacy

To use NLP methods that require the use of spacy:

python -m spacy download en_core_web_sm

Spark

To make Python dependencies available to Spark executors, use the dependency packaging script:

bash ./seldonite/spark/package_pyspark_deps.sh

Tests

We use pytest.

To run tests, run these commands from the top level directory:

pytest

Credits