adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

PythonApache-2.0

Pinned issues

List of smaller extraction bugs (text & metadata)

#4 opened 5 years ago by adbar

Open30

Issues

Trafilatura fails to extract structured heading tags (h2, h3)
#774 opened 4 days ago by LeMoussel
0
Turning on "--keep-dirs" gives no output
#771 opened 22 days ago by DesBw
3
Duplicated lines when nested in <article> and <main>, with <br> in front
#768 opened a month ago by ibestvina
5
Question regarding title extraction
#770 opened a month ago by unsleepy22
1
Loss format or data when li contains p
#769 opened a month ago by ezscode
1
Documentation: on precision
#766 opened a month ago by DesBw
1
Extracting full text from an URL returns None
#739 opened a month ago by vrnch
2
CLI: better control of output file names
#754 opened a month ago by DesBw
2
AttributeError in prune_unwanted_nodes
#760 opened a month ago by PLPeeters
2
Backticks produce extra line breaks
#755 opened a month ago by klvbdmh
1
Support for sidemap parsing from text instead of urls
#751 opened a month ago by NiClassic
1
Performance bottleneck in `prune_unwanted_nodes` causing 200ms per call
#750 opened 2 months ago by thsunkid
1
Explicitly and fully support type hinting
#738 opened 2 months ago by adbar
0
Review input type for `is_probably_readerable()` function
#749 opened 2 months ago by adbar
0
Documentation about settings could use examples
#746 opened 2 months ago by georgedorn
1
Download multiple urls with download timeout
#703 opened 2 months ago by vodkaslime
2
Extraction: move `max_tree_size` to config file
#741 opened 2 months ago by adbar
0
setup: set `__all__` in `__init__.py`
#718 opened 2 months ago by adbar
0
Crawler doesn't extract any links from Google Cloud documentation website
#680 opened 4 months ago by Guthman
6
Downloads: fully use information from both `config` and `options` variables
#733 opened 2 months ago by adbar
0
CLI downloads: make sure all user-specified options are used
#732 opened 2 months ago by andyskipper
4
`bare_extraction()`: deprecate `as_dict` parameter
#729 opened 2 months ago by adbar
0
`extract()`: replace `no_fallback` argument by `fast`
#725 opened 2 months ago by adbar
0
feat(cli/lib): Add tqdm based progress bar as an option
#663 opened 5 months ago by chitralverma
1
I can't extract main content from this html,could anyone help me?
#702 opened 3 months ago by CNXDZS
1
extract function runs indefinitely on large HTML body content
#704 opened 3 months ago by hitesh1997
1
Focused crawler returns 404 response for robots.txt and stops crawling
#726 opened 3 months ago by Guthman
1
Deprecate `fetch_url(decode=False)`
#722 opened 3 months ago by adbar
0
HTML_TAG_MAPPING error during scrape
#701 opened 3 months ago by beefyandbeef
2
Review HTML element list and conversion
#720 opened 3 months ago by adbar
0
setup: use `pyproject.toml` file
#712 opened 3 months ago by adbar
0
Remove deprecations (mostly CLI)
#676 opened 3 months ago by adbar
0
Trafilatura crashing due to `options` variable not backfilled yet
#705 opened 3 months ago by rgeronimi
1
Javascript Version has landed. 🚀
#688 opened 3 months ago by vtempest
3
Empty Results When Using Spider Function with Category URL
#696 opened 4 months ago by felipehertzer
5
Link on the quickstart page to the overview notebook is broken
#695 opened 4 months ago by cdfuller
1
Docs: add page explaining how to run tests
#698 opened 4 months ago by adbar
0
Downloads: add support to switch between proxies
#697 opened 4 months ago by adbar
0
ImportError: lxml.html.clean module is now a separate project
#693 opened 4 months ago by regstuff
2
ValueError in xml
#681 opened 4 months ago by Honesty-of-the-Cavernous-Tissue
3
How can I set the proxy IP port and userAgent to avoid the web anti-crawler mechanism?
#666 opened 5 months ago by coderwpf
2
spider: restrict search to given URL pattern
#672 opened 5 months ago by adbar
0
trafilatura version > 1.10.0 doesnt fetch images
#670 opened 5 months ago by rkiacnhg
3
Investigate spacing in element tails
#661 opened 6 months ago by adbar
3
AttributeError in prune_unwanted_sections
#667 opened 5 months ago by Honesty-of-the-Cavernous-Tissue
3
Bug or feature, I'm not sure!
#662 opened 5 months ago by szj2ys
1
Faulty extraction for very short documents
#660 opened 6 months ago by Psynbiotik
4
Duplicating sections, removing spaces between words, simple example
#659 opened 6 months ago by nthomas-whistic
0
MemoryError in table conversion
#657 opened 6 months ago by Honesty-of-the-Cavernous-Tissue
2
Extraction with `include_images=True` takes too much time
#651 opened 6 months ago by Honesty-of-the-Cavernous-Tissue
3