adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
PythonApache-2.0
Pinned issues
Issues
- 0
- 3
Turning on "--keep-dirs" gives no output
#771 opened by DesBw - 5
- 1
Question regarding title extraction
#770 opened by unsleepy22 - 1
Loss format or data when li contains p
#769 opened by ezscode - 1
Documentation: on precision
#766 opened by DesBw - 2
Extracting full text from an URL returns None
#739 opened by vrnch - 2
CLI: better control of output file names
#754 opened by DesBw - 2
AttributeError in prune_unwanted_nodes
#760 opened by PLPeeters - 1
Backticks produce extra line breaks
#755 opened by klvbdmh - 1
- 1
- 0
Explicitly and fully support type hinting
#738 opened by adbar - 0
- 1
Documentation about settings could use examples
#746 opened by georgedorn - 2
Download multiple urls with download timeout
#703 opened by vodkaslime - 0
Extraction: move `max_tree_size` to config file
#741 opened by adbar - 0
setup: set `__all__` in `__init__.py`
#718 opened by adbar - 6
- 0
- 4
- 0
`bare_extraction()`: deprecate `as_dict` parameter
#729 opened by adbar - 0
`extract()`: replace `no_fallback` argument by `fast`
#725 opened by adbar - 1
- 1
- 1
- 1
- 0
Deprecate `fetch_url(decode=False)`
#722 opened by adbar - 2
HTML_TAG_MAPPING error during scrape
#701 opened by beefyandbeef - 0
Review HTML element list and conversion
#720 opened by adbar - 0
setup: use `pyproject.toml` file
#712 opened by adbar - 0
Remove deprecations (mostly CLI)
#676 opened by adbar - 1
- 3
Javascript Version has landed. 🚀
#688 opened by vtempest - 5
- 1
- 0
Docs: add page explaining how to run tests
#698 opened by adbar - 0
Downloads: add support to switch between proxies
#697 opened by adbar - 2
- 3
ValueError in xml
#681 opened by Honesty-of-the-Cavernous-Tissue - 2
How can I set the proxy IP port and userAgent to avoid the web anti-crawler mechanism?
#666 opened by coderwpf - 0
spider: restrict search to given URL pattern
#672 opened by adbar - 3
trafilatura version > 1.10.0 doesnt fetch images
#670 opened by rkiacnhg - 3
Investigate spacing in element tails
#661 opened by adbar - 3
- 1
Bug or feature, I'm not sure!
#662 opened by szj2ys - 4
Faulty extraction for very short documents
#660 opened by Psynbiotik - 0
- 2
- 3
Extraction with `include_images=True` takes too much time
#651 opened by Honesty-of-the-Cavernous-Tissue