article-extractor

There are 72 repositories under article-extractor topic.

  • adbar/trafilatura

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Language:Python4.1k31404286
  • article-extractor

    extractus/article-extractor

    To extract main article from given URL with Node.js

    Language:JavaScript1.7k16168147
  • scotteh/php-goose

    Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

    Language:PHP461200119
  • Strumenta/SmartReader

    SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla

    Language:C#166103937
  • hipstermojo/paperoni

    An article extractor in Rust

    Language:Rust133396
  • artiomn/markdown_articles_tool

    Parse markdown article, download images and replace images URL's with local paths

    Language:Python12142325
  • fterh/sneakpeek

    Reddit bot to preview and post hyperlinks as comments

    Language:Python102111717
  • web64/nlpserver

    NLP Web Service

    Language:Python94111026
  • inaridiy/webforai

    The best HTML to Markdown library, A esm-native & Useful Utilities with simple, lightweight and epic quality.

    Language:TypeScript58105
  • web64/laravel-nlp

    Laravel wrapper for common NLP tasks

    Language:PHP55558
  • myifeng/article-parser

    Extract article or news by url or html, parse the title and content, output in markdown format.

    Language:Python49137
  • johnbumgarner/newshound

    This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.

  • Creator-SN/IKFB

    Involution King Fun Book (IKFB, Chinese: 快卷, 卷王快乐本) is an integrated management system for papers and literature. Powered by Electron.

    Language:Vue32264
  • clarivate/wos-excel-converter

    This is a small and easy-to-use desktop application that allows exporting Web of Science API Expanded and InCites API data in Excel/CSV/JSON/XML with a configurable and flexible data export structure.

    Language:Vue313267
  • KotlinSpringBoot/saber

    【 Spring Boot 实战开发】10 分钟快速构建一个自己的技术文章博客

    Language:Kotlin312121
  • woojubb/html-article-extractor

    A web page content extractor

    Language:JavaScript20011
  • lord-alfred/dnlp

    📚 Сборник полезных штук из Natural Language Processing: Определение языка текста, Разделение текста на предложения, Получение основного содержимого из html документа

    Language:Python19305
  • pgh268400/Dcinside_Explorer_Python

    디시인사이드 Client-Side 글 검색기 입니다.

    Language:Python18102
  • kwaziidev/textractor

    从html中提取正文,用于新闻类网页

    Language:Go16114
  • Sathish-Vasudev/Article-Scraper

    The program can be used to scrape the content from an article from web by an input of a set of URLs in a text file or a URL. This project uses newspaper3k and python-docx libraries. The output of this program will give a neatly modified Word Document in '.docx' format with the contents of the article.

    Language:Python16134
  • eneiromatos/NebulaExpiredArticleHunter

    Nebula Expired Article Hunter is a marketing tool you can use to get expired content from www.archive.org A.K.A. wayback machine, you could use this kind of content to grow up your blog with evergreen information, improve your marketing campaigns without investing in writing services, or whatever you imagine is useful for.

    Language:Python13208
  • ai-summarizer

    sanidhyy/ai-summarizer

    Modern OpenAI GPT-4 Article Summarizer

    Language:JavaScript12204
  • KhanShaheb34/ProthomAloScraper

    A python script to scrap articles from Prothom Alo with the Headline, Category, URL, and Summary

    Language:Python11404
  • metalwarrior665/actor-article-extractor-smart

    Combines Apify's crawling system and article parsing with unfluff library.

    Language:JavaScript11375
  • pavlovtech/article-parser

    Simple HTTP API endpoint that takes URL to any article and returns JSON object containing information about the article.

    Language:Python11100
  • bharathvaj-ganesan/artixtractor

    Extract article/blog from websites like [medium.com, inc42.com,etc]:100:

    Language:JavaScript10304
  • victormartinez/ferret

    A modern pythonic lib to extract data from news pages

    Language:HTML9100
  • gadzan/generatoc

    Automatically generate table of content from heading of HTML document

    Language:TypeScript8142
  • jpjacobpadilla/Google-Docs-To-Clean-HTML

    Transform messy HTML from Google Docs into well-structured HTML!

    Language:Python8101
  • AndyTheFactory/article-extraction-dataset

    Article title, authors, date and body extraction dataset.

    Language:HTML6201
  • soberbichler/Notebooks4Historical_Newspapers

    Notebooks that use LLMs to work with historical documents and artefacts

    Language:Jupyter Notebook6300
  • AbdulMoizAli/Extractive-Text-Summarization

    Automatic Extractive Text Summarization using TF-IDF Frequency Analysis. This is a Node.js web application using Express.js on the server side.

    Language:JavaScript5104
  • mccallofthewild/alexandrias-revenge

    🔥The bold new archive that can’t be burned, bulldozed or battering-rammed #PoweredByArweave

    Language:TypeScript4201
  • pgh268400/Dcinside_ImageCrawler

    디시인사이드 이미지 크롤러

    Language:Jupyter Notebook4101
  • Quantlight/AI-Powered-News-Summarizer

    The main goal of an AI-Powered News Summarizer is to assist users in quickly understanding the main points and essential information from a large volume of news articles or textual content. By automatically summarizing news articles, it saves time and effort by providing users with a brief overview without having to read the entire article.

    Language:Python4131
  • RobinMillford/Cortex-AI-Multi-Model-Insights-Hub

    Cortex AI: Multi-Model Insights Hub is an advanced platform that leverages cutting-edge AI to empower your research, analysis, and data exploration. By integrating multiple Large Language Models (LLMs) with a sophisticated Retrieve-and-Generate (RAG) system

    Language:Python4201