adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
PythonApache-2.0
Pinned issues
Issues
- 6
focused_crawl returns nothing
#589 opened by bezir - 14
LXML 5.2.0 breaks import
#532 opened by marban - 2
No timeout in urllib.robotparser with focused_crawler
#566 opened by JER-CE - 2
- 1
<main> Content gets missed out
#588 opened by alroythalus - 1
Extracting content from an URl is getting none
#586 opened by Fabiha15 - 2
Wrong links position in text from telegram post
#585 opened by RedHotUnicorn - 0
Regroup deduplication functions in same submodule
#576 opened by adbar - 0
Deprecate functions and arguments
#480 opened by adbar - 7
Question: check if page is readable?
#572 opened by zirkelc - 7
Scraping websites which are protected by WAF
#558 opened by thebigbone - 0
Update XML-TEI reference data
#577 opened by adbar - 4
Content extraction failure on dozens of related sites
#569 opened by praveng - 1
Extract text from buttons for semantic elements
#573 opened by zirkelc - 1
Content failed to be extracted
#568 opened by alroythalus - 5
List element inside a table is lost
#531 opened by mikhainin - 1
Markdown tables have incorrect format
#562 opened by zirkelc - 1
Readme.md table is broken.
#557 opened by AnishPimpley - 3
Preserve horizontal space in code blocks
#553 opened by mittsommer - 1
strikethrough text is returned as normal
#555 opened by snarb - 1
Make markdown an explicit output format
#489 opened by adbar - 1
Add download/processing date to metadata
#490 opened by adbar - 5
Why lzma for data compression?
#559 opened by Yomguithereal - 0
- 6
- 3
Wrong encoding detected: gb2312
#541 opened by s-jse - 0
Refactor and improve readability-lxml syntax
#546 opened by adbar - 0
CLI: raise an error if `--config-file` doesn't exist
#482 opened by adbar - 4
- 1
Regroup functions dedicated to output conversion
#500 opened by adbar - 0
- 0
- 0
Link proportion heuristic fails for link paragraph
#529 opened by adbar - 25
Change of license? GPLv3+ → Apache 2.0
#512 opened by adbar - 1
Doesn't extract links in table
#523 opened by obeone - 3
Link section missed at bottom of page
#518 opened by adbar - 0
PDF as output format?
#519 opened by adbar - 5
- 7
License
#475 opened by fakerybakery - 1
- 6
Extract more text
#488 opened by vulinh48936 - 0
Sitemaps: implement sleep and/or backoff strategy
#505 opened by adbar - 0
For all the articles from the source https://ognnews.com/ the extracted title is not right.
#502 opened by rithvikshetty - 1
Update LXML to version 5.1+
#477 opened by adbar - 1
save cookies on redirect
#478 opened by zeliboba7 - 6
include_links option mixes texts and links
#476 opened by hugoobauer - 5
fetch_url('spiegel.de/....') returns None
#474 opened by robertour - 0
Add support for Netscape cookies file format
#473 opened by adbar - 1
Add HTML output option
#472 opened by adbar - 0
Missing Yoast FAQ block headers
#471 opened by adbar