/news-aggregator

A news aggregator web app using BeautifulSoup4, Django, Django REST framework, Elasticsearch, and periodic tasks for automated updates.

Primary LanguagePythonMIT LicenseMIT

News Aggregator

Web App that aggregates news articles from multiple sources using BeautifulSoup4 for web scraping, Django for web development, Django REST framework for building APIs, and Elasticsearch for search functionality.

Installation

Setup the project

  • add the crontab job with details
*/30 * * * * source yourenv/activate && cd news-aggregator && python3 manage.py scrape > /path/to/log 2>&1
  • Migrate your database and build search index
    python3 manage.py makemigrations
    python3 manage.py migrate
    
    python3 manage.py search_index --rebuild

Run the app!

after everything is set up, run the django app as usual

python3 manage.py runserver

or you can use gunicorn to run the wsgi app

gunicorn newsaggregator.wsgi

finally, you can browse the api at api/

Customize Scraper

you can also create your own scraper, you just need set the title, content, and date attribute

still don't get it? check this example code :

<p class="date">13 Apr 2023</p>
<h1 class="title">This is Example title of the news article</h1>
<div class='detail-in' id='isi'>
    <p>Lorem ipsum dolor sit amet</p>
    <p>Azaret metrio zintos!</p>
</div>

all you just need is inherate the ```Spider`` class in utils/core/base.py and set the attribute

example in utils/modules/tempo.py

from utils.core.base import Spider

class TempoSpider(Spider):
    def __init__(self):

        self.base_url = [
            'https://www.tempo.co',
            'https://nasional.tempo.co',
            'https://gaya.tempo.co',
            'https://dunia.tempo.co'
            ]
        
        super().__init__(self.base_url)

        self.title_attr = {
            "name":"h1",
            "attrs":{
                "class":"title"
            }
        }
        
        self.content_attr = {
            "name":"div",
            "attrs":{
                "class":"detail-in",
                "id":"isi"
            }
        }
        self.date_attr = {
            "name":"p",
            "attrs":{
                "class":"date"
            }
        }