adbar
Research scientist and data engineer – Open source enthusiast, mostly in Python
Berlin-Brg. Academy of Sciences (BBAW)Berlin
Pinned Repositories
awesome-crawler
A collection of awesome web crawler,spider in different languages
courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
flux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
geokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
german-reddit
Extraction of a German Reddit Corpus
htmldate
Fast and robust date extraction from web pages, with Python or on the command-line
py3langid
Faster, modernized fork of the language identification tool langid.py
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
adbar's Repositories
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
adbar/courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
adbar/htmldate
Fast and robust date extraction from web pages, with Python or on the command-line
adbar/py3langid
Faster, modernized fork of the language identification tool langid.py
adbar/geokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
adbar/german-reddit
Extraction of a German Reddit Corpus
adbar/awesome-crawler
A collection of awesome web crawler,spider in different languages
adbar/flux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
adbar/tweets-tools
Diverse tools used with Twitter data
adbar/coronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
adbar/jlcl-style
Experiments to modernize the LaTeX class of the JLCL
adbar/microblog-explorer
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
adbar/toponyms
Old prototype for toponym extraction in historical texts written in German
adbar/vardial-experiments
Experiments conducted on the occasion of the VarDial shared tasks
adbar/zeitcrawler
Automatically exported from code.google.com/p/zeitcrawler
adbar/adbar
adbar/awesome-digital-humanities
Software for humanities scholars using quantitative or computational methods.
adbar/awesome-web-scraping
List of libraries, tools and APIs for web scraping and data processing.
adbar/btw21
Visualization of the most frequent words in the German federal election in 2021
adbar/corpus-visualizer
Explore, visualize and publish corpora as CSS/XHTML documents
adbar/equipe-crawler
Automatically exported from code.google.com/p/equipe-crawler
adbar/gps-corpus-builder
Automatically exported from code.google.com/p/gps-corpus-builder
adbar/haystack-integrations
🚀 A list of Haystack Integrations, maintained by the community or deepset.
adbar/jparser
A readability parser which can extract title, content, images from html pages
adbar/laclos
LAnguage-CLassified OpenSubtitles
adbar/lichess-bot
A bridge between Lichess bots and chess engines
adbar/python-chess
A chess library for Python, with move generation and validation, PGN parsing and writing, Polyglot opening book reading, Gaviota tablebase probing, Syzygy tablebase probing, and UCI/XBoard engine communication
adbar/valency-oriented-chunker
A one-pass FSA valency-oriented chunker for German (proof of concept)