proycon
Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 InfoSec - https://git.sr.ht/~proycon
KNAW Humanities Cluster & CLST, Radboud UniversityEindhoven, the Netherlands
Pinned Repositories
stam
Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.
analiticcl
an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
clam
Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
codemetapy
A Python package for generating and working with codemeta
colibri-core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
dotfiles
My dotfiles (mirror of https://git.sr.ht/~proycon/dotfiles)
flat
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
vocage
A minimalistic spaced-repetion vocabulary trainer (flashcards) for the terminal
proycon's Repositories
proycon/vocage
A minimalistic spaced-repetion vocabulary trainer (flashcards) for the terminal
proycon/clam
Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
proycon/colibri-core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
proycon/flat
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
proycon/folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
proycon/python-frog
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)
proycon/analiticcl
an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
proycon/python-ucto
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
proycon/codemetapy
A Python package for generating and working with codemeta
proycon/dotfiles
My dotfiles (mirror of https://git.sr.ht/~proycon/dotfiles)
proycon/homeassistant-config
My elaborate home automation configuration + scripts
proycon/foliapy
An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
proycon/python-timbl
python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications.
proycon/foliatools
A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
proycon/codemeta-harvester
Harvest and aggregate codemeta/schema.org software metadata from source repositories and service endpoints, automatically converting from known metadata schemes in the process
proycon/unilangforum
UniLang Language Community - Forum
proycon/lingua-cli
Very small simple command-line interface for language detection using lingua-rs
proycon/sesdiff
Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein).
proycon/alpino_clam_webservice
A CLAM-powered webservice for Alpino, a dependency parser for Dutch
proycon/vocadata
Data for vocabulary learning
proycon/colibri-utils
NLP utilities that rely on Colibri Core: currently only language identification
proycon/lexmatch
Simple lexicon matcher against a text
proycon/charfreq
Very simply command-line tool that counts (unicode) character frequency from standard input
proycon/homepage
My website (mirror of https://git.sr.ht/~proycon/homepage)
proycon/cli-apps
The largest Awesome Curated list of CLI/TUI applications with source data organized into CSV files
proycon/codemeta2mp
codemeta to SSHOC Open Marketplace converter
proycon/globalise-tools
tools for globalise tasks
proycon/lighthome
Lightweight home automation scripts and programs, over MQTT (mirror of https://git.sr.ht/~proycon/lighthome)
proycon/switchboard-tool-registry
The Switchboard Tool Registry
proycon/ucto_webservice
Webservice for the ucto, a rule-based tokeniser for multiple languages