Note: We don't recommend using this package and have this archived it. A lot of these things can be solved using other packages such as asent, spacy etc.
You shouldn't text to your X but you should extract from text. Text To X, a quick an easy to use NLP pipeline for converting text to topics, tokens, sentiment and more.
Install by running the following line:
pip3 install git+https://github.com/centre-for-humanities-computing/text_to_x
To reinstall the package use the following code
pip3 install --force-reinstall --no-deps git+https://github.com/centre-for-humanities-computing/text_to_x
Please note that polyglot is required for language detection, but have been removed from dependencies multiple installation issues. We thus leave the installation of polyglot to the user and refer question regarding the installation to the polyglot forum. text_to_x works without polyglot as long as language is specified.
Tokenization and token tagging utilized stanza, flair and keras for tokenization, lemmatization, pos-tagging, dependency parsing and NER-tagging.
Let's start of by loading some data, we will use the Danish "fyrtårnet" by HC. Andersen and use 3 shorts
splits of 50 sentences. For the full text set short_splits=False
. It is also possible to set data='origin_of_species'
.
import text_to_x as ttx
texts = ttx.get_test_data(data="fyrtårnet", short_splits=True)
And the use is very simple:
ttt = ttx.TextToTokens()
dfs = ttt.texts_to_tokens(texts)
Currently at text: 0
Use device: cpu
---
Loading: tokenize
With settings:
...
Currently at text: 1
Currently at text: 2
Currently at text: 3
Examining the output we see that dfs have a length equal to the number of strings in texts
and that the output is a pandas dataframe.
len(dfs)
3
df = dfs[0] # take the first item
df.head()
n_sent token lemma upos xpos dependency relation
0 0 Der der ADV _ expl
1 0 kom komme VERB _ root
2 0 en en DET _ det
3 0 soldat soldat NOUN _ obj
4 0 marcherende marcherende VERB _ xcomp
In the more extended use cases the you can modify the arguments more e.g.:
ttt = ttx.TextToTokens(lang = ["da", "da", "da"],
tokenize="stanza",
lemmatize="stanza",
stemming=None,
pos="stanza",
mwt="stanza",
depparse="stanza",
ner="stanza")
dfs = ttt.texts_to_tokens(texts = texts)
Note that language can also be a list of languages and if left out the language is detected using polyglot.
Text to X utilized an altered version of a dictionary or a multilingual BERT. For the dictionary approach it used an altered version of vaderSentiment allowing for multiple languages and use of tokens, lemmas or stems derived from TextToTokens.
The simple use case is (using the same texts as above):
tts = TextToSentiment(lang="da", method="danlp_bert_tone")
df = tts.texts_to_sentiment(texts)
print(df)
tts = TextToSentiment(lang="da", method="danlp_bert_tone") df = tts.texts_to_sentiment(texts) df
polarity_pos polarity_neu polarity_neg analytic_obj analytic_subj
0 0.002590 0.003864 0.993547 0.999044 0.000956
1 0.082117 0.027139 0.890743 0.999934 0.000066
2 0.349517 0.565464 0.085019 0.999946 0.000055
If we want to use it with TextToTokens we can do as follow:
# create the TextToTokens
ttt = ttx.TextToTokens()
ttt.texts_to_tokens(texts)
# initialize the TextToSentiment
tts = TextToSentiment(method="dictionary")
# simply pass the ttt as the first argument
df = tts.texts_to_sentiment(ttt)
In estimated order
- Make a class TextToTopic for topic modelling using gensim mallet and LDA
- add fast a fast tokenizer for TextToTokens
- Add entity tagger
- add entity tagger for Danish
- Update TextToDf to use Stanza instead of stanfordnlp
- Additions to the TextToSentiment class
- add token_to_sentiment, which give the sentiment of each token
- add sentence_to_sentiment, which give the sentiment of each sentence
Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D. (2020). Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages. arXiv
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Lauridsen, G. A., Dalsgaard, J. A., & Svendsen, L. K. B. (2019). SENTIDA: A New Tool for Sentiment Analysis in Danish. Journal of Language Works-Sprogvidenskabeligt Studentertidsskrift, 4(1), 38-53.