adbar
Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
Berlin-Brg. Academy of Sciences (BBAW)Berlin
Pinned Repositories
awesome-crawler
A collection of awesome web crawler,spider in different languages
awesome-web-scraper
A collection of awesome web scaper, crawler.
courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
geokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
german-reddit
Extraction of a German Reddit Corpus
htmldate
Fast and robust date extraction from web pages, with Python or on the command-line
py3langid
Faster, modernized fork of the language identification tool langid.py
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
adbar's Repositories
adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
adbar/htmldate
Fast and robust date extraction from web pages, with Python or on the command-line
adbar/courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
adbar/py3langid
Faster, modernized fork of the language identification tool langid.py
adbar/geokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
adbar/german-reddit
Extraction of a German Reddit Corpus
adbar/awesome-crawler
A collection of awesome web crawler,spider in different languages
adbar/awesome-web-scraper
A collection of awesome web scaper, crawler.
adbar/flux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
adbar/tweets-tools
Diverse tools used with Twitter data
adbar/coronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
adbar/jlcl-style
Experiments to modernize the LaTeX class of the JLCL
adbar/jusText
Heuristic based boilerplate removal tool
adbar/toponyms
Old prototype for toponym extraction in historical texts written in German
adbar/trafilatura_gui
adbar/vardial-experiments
Experiments conducted on the occasion of the VarDial shared tasks
adbar/adbar
adbar/archiveis
A simple Python wrapper for the archive.is capturing service
adbar/btw21
Visualization of the most frequent words in the German federal election in 2021
adbar/cChardet
universal character encoding detector
adbar/datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
adbar/dateparser
python parser for human readable dates
adbar/dwdsmor
SFST/SMOR/DWDS-based German Morphology
adbar/jparser
A readability parser which can extract title, content, images from html pages
adbar/python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
adbar/shoten
adbar/valency-oriented-chunker
A one-pass FSA valency-oriented chunker for German (proof of concept)
adbar/wee-benchmarking-tool