Phraser is a DSL for recognizing English phrases. It finds consecutive lists of subsequences that are defined by lists of tokens with embedded token-matching expressions. Expressions consist of a type, arguments, and attribute filters.
Contents: * Demo * Expressions * All-at-once expressions * Penn part-of-speech tag — (tag TAG) or (TAG) * Dynamic expressions * Number — (number +type +polarity) * Regular expression — (regex regex) * Precomputable expressions * Custom token group — (oneof tokens...) or (token1|token2...) * Personal pronoun — (perspro +case +gender +number +person +personhood) * Possessive determiner — (posdet +gender +number +person +personhood) * Possessive pronoun — (pospro +case +gender +number +person +personhood) * Possessive token — (pos) * Verb — (to lemma +fieldtype +number +person) * Raw tokens * Configuration * Expression syntax * Phrase file syntax * Architecture * Preprocessing * HTML entity parsing * Destuttering * Unicode to ASCII normalization * Sentence boundary detection * Tokenization * Token normalization * Tagging * Contraction reversing * Textspeak normalization
This phrase file:
threat = subject, aux verb, intensifier, verb, object ---------- (perspro +subj +3rd +thing) (DT) borg ---------- will ---------- fucking ---------- assimilate ---------- (posdet +thing) (butt|ass) (perspro +obj)
Plus this input text:
The Borg will assimilate your ass.
Results in:
TODO
All expressions are checked for validity by the expression evaluator of their type during initialization.
All-at-once expressions require all the input tokens at once to make their judgements about whether each of them is a match. Used for filtering on Penn part-of-speech tags.
All-at-once expression evaluators contain an AnalyzeTokens() method which generates some opaque metadata about each token, and an IsMatch() method which makes a judgment about a token with metadata.
(tag <uppercase Penn POS tag>)
or(<uppercase Penn POS tag>)
Dimension | Possible filter values |
---|---|
N/A | N/A |
Dynamic expressions are open-class. Each expression is evaluated against each input token at call time.
Dynamic expression evaluators contain a MightMatch() method which may rule out all expressions of its type.
(number ...)
Dimension | Possible filter values |
---|---|
class | +float +int |
polarity | +neg +nonneg |
(regex <regex>)
Dimension | Possible filter values |
---|---|
N/A | N/A |
Precomputable expressions are closed-class, so we enumerate every possible match and put these matches (literal tokens) in a lookup table during initialization.
(oneof <space-separated list of tokens>)
Dimension | Possible filter values |
---|---|
N/A | N/A |
(perspro ...)
Dimension | Possible filter values |
---|---|
case | +obj +refl +subj |
gender | +female male neuter |
number | +plur +sing |
person | +1st +2nd +3rd |
personhood | person thing |
(posdet ...)
Dimension | Possible filter values |
---|---|
gender | +female male neuter |
number | +plur +sing |
person | +1st +2nd +3rd |
personhood | person thing |
(pospro ...)
Dimension | Possible filter values |
---|---|
case | +obj +refl +subj |
gender | +female male neuter |
number | +plur +sing |
person | +1st +2nd +3rd |
personhood | person thing |
(pos)
Dimension | Possible filter values |
---|---|
N/A | N/A |
(to <verb lemma> ...)
Dimension | Possible filter values |
---|---|
field type | +lemma +past +pastpart +pres +prespart |
number | +plur +sing |
person | +1st +2nd +3rd |
Everything that is not an expression is a raw token which is matched verbatim.
(<type> <0+ whitespace-separated args> <0+ whitespace-separated filters>)
or
(<upper case Penn POS tag>)
or
(<2+ args separated by '|'>)
where * (<upper case Penn POS tag>)
will be normalized to
(tag <upper case Penn POS tag>)
* (<2+ args separated by '|'>)
will be normalized to (oneof <2+ args separated by '|'>)
* an arg
is arbitrary text not containing whitespace with +
, (
, and )
escaped with \
* a filter is ^\+[a-z0-9]+$
(note the +
prefix)
<phrase name> = <1+ comma-separated subsequence names> <1+ newline-separated sequences>
a subsequence is
<dash divider> <1+ newline-separated item lists>
where * a phrase name is ^[a-z ]+$
* a subsequence name is
^[a-z ]+$
* subsequence names will be trimmed on both sides * the
number of subsequence names must match the number of sequences * a dash
divider is ^\\-+$
* an item list is 0+ space-separated items (ie,
lines can be blank) * an item is either a token or an expression * a
token is a string separable by whitespace * an expression is a string
containing arbitrary text separated by (
and )
* occurences of
(
and )
inside an expression must be escaped by \
Analyzer (cc/analysis/) | \ | Frontend (cc/frontend/) | \ | +--HTMLEntityParser (cc/frontend/html/) | +--Destutterer (cc/frontend/destutter/) | +--AsciiNormalizer (cc/frontend/ascii/) | +--SentenceSplitter (cc/frontend/sbd/) | +--Tokenizer (cc/frontend/tokenize/) | +--Americanizer (cc/frontend/americanize/) | +--Tagger (cc/frontend/tag/) | +--Uncontractor (cc/frontend/contractions/) | +--TextSpeakNormalizer (cc/frontend/textspeak/) / / PhraseDetector (cc/phrase_detection/) / \ / EnglishExpressionEvaluator (cc/expression/) / \ VectorMembership +--PrecomputableEvaluators SequenceDetector +--DynamicEvaluators (cc/sequence_detection/) (cc/english/, cc/tagging/) SequenceDetector * EqualitySequenceDetector * VectorMembershipSequenceDetector ExpressionTypeEvaluator * PrecomputableEvaluator * DynamicEvaluator
Raw text is transformed into tagged tokens for use by the phrase detectors.
Conversions: HTML
→ Unicode
→ ASCII
→ list of tokens
→
list of (possible tokens, tag)
We use code from LAPOS for tokenization and especially tagging.
Some of the Unicode normalization and token normalization is designed to behave like the Stanford parser.
Example: © © ©
→ © © ©
Example: Whooooooooooooooa!!!!!!
→ Whoooa!
We drop overly repeated characters.
Essentially, we want to strip accents, map symbols to ASCII equivalents, and use LaTeX quotes.
The following steps occur for all Unicode code points in any index below in order to generate a static mapping:
- Replace nonprintable ASCII with space (U+0020).
- Normalize the various Unicode open/close quote styles to smart quotes
(eg,
«
»
to“
”
)
- Normalize currency symbols to
$
andcents
(to match WSJ training data)
- Convert smart quotes to spaced Penn Treebank tokens (eg,
“
”
to``
''
)
- Decompose the Unicode code points according to NFKD
- Replace non-ASCII Unicode code points with visually confusable code point sequences of type SA (same script, any case) that contain at least one ASCII code point
- confusables.txt from ICU
- Filter out non-ASCII characters.
- Join into a string.
- Condense spaces.
- Drop parenthesized non-Latin characters that don't map to ASCII (eg,
U+3208
㈈
).
We use a custom rule-based classifier written for web comments.
The result of the previous steps is then fed to the LAPOS tokenizer.
We make some changes in order to match the tagger's training data.
- Certain punctuation tokens are escaped (eg,
(
to-LRB-
)
- Commonwealth spellings are Americanized
Respelled tokens are fed to the LAPOS tagger, which uses a model pretrained on WSJ sections 2-21.
We reverse contractions, using the part-of-speech tag to disambiguate
verb 's
and possessive 's
. This results in multiple possible
words for some contractions (ie. 's = is/has, 'd = did/had/would/).
We list alternate forms of tokens ("ur" → "ur", "your", "you're").
Important:
New features:
- "oneof" expressions. Including the "|" syntax.
- Regex expressions. Parsing those out of the phrase configs will be fun. Update expression syntax section of README.md when done.
- Multiple possible tags in tag expressions?
- Support "(TAG)" syntax.
- Implement (pos) expression type for ' and 's.
Correctness:
List the possible Penn tags for checking tag expression arguments (ie, the tag the user wants to filter on) in TagEvaluator.
Multi-token expressions. Internally preprocess the precomputable expressions to generate normal single-token expressions. Needed for some possessive personal edge cases to work. Example:
phrase config: "I have (posdet +you) homework"
- internal config: "I have (posdet-1-of-1 +you) homework"
"I have (posdet-1-of-2 +you) (posdet-2-of-2 +you) homework"
user input: "I have yall's homework"
tokenized: "I have yall 's homework"
I do not forsee having a use for non-precomputable multi-word expressions.
Destuttering: Make it work on canonically equivalent code point sequences. Can't solve it by just NFC normalization, have to take combining diacritical marks into account. Also do NFC normalization before calling it.
Phraser is a nonessential python extension around a C++ codebase that uses several C++11 features. You'll need python and a recent C++ compiler.
We support clang++ (Ubuntu and Mac OS X) and g++ (Ubuntu only), going back to g++ 4.7 (released March 2012) and clang++ 3.4 (released January 2014). Earlier clang versions could probably be supported by dropping some flags. Further than that would require nontrivial code changes.
Currently, the build flags are much stricter when using clang. It's geared toward development being done against clang, and deployment using g++ on an older system.
- Destuttering handles bigrams ("hahahahaha" → "haha").
- Destuttering handles symbols ("😋😋😋" → "😋").
- Added basic textspeak normalization.
User visible:
- Phrase configs are now defined in YAML (before, a custom text format).
- Boolean operators on expressions are added (and, or, not, etc.).
Backend:
- Integrated a rule-based sentence boundary detector for web comments (before, assumed one sentence per input).
- English contractions are automatically replaced with their uncontracted equivalents.
- All-at-once expressions removed (use dynamic expressions instead).
- Tagging is now done automatically in the frontend.
- Fix release: fix build_ext for more recent Ubuntu releases. Chooses compiler based on /etc/lsb-release.
- Fix release: setup.py defaults to building the python extension using g++-4.7 when not on Darwin in order for build_ext to work on an older system. build_ext is now broken on python 2.7.8 due to flags setup.py automatically inserts.
- Add support for g++ 4.7 and 4.8 when on Linux (tested versions: 4.7.4-2ubuntu1, 4.8.3-12ubuntu3).
- Add support for g++ when on Linux (tested version: 4.9.1-16ubuntu6).
- Fix release: add graft command.
- Fix release: package the header files as well.
- Rewrite the python extension to return an object that contains the state, instead of calling init at the module level.
- Add valgrind invocations.
- Fix release.
- Phraser is now importable via pip as a python module.
- Initial release. Written in C++11. Also builds a python extension. Compile with clang on Xubuntu or OS X. Tested versions:
Xubuntu:
clang version 3.6.0 (trunk 223446) Target: x86_64-unknown-linux-gnu Thread model: posix
OS X:
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn) Target: x86_64-apple-darwin13.4.0 Thread model: posix