This package is an R wrapper to the spaCy "industrial strength natural language processing" Python library from http://spacy.io.
-
Install or update Python on your system.
macOS and Linux typically come with Python installed, although you may wish to install a newer or different version from https://www.python.org/downloads/.
Windows only: If you have not yet installed Python, Download and install Python for Windows. We recommend using Python 3, although the Python 2.7.x also works. During the installation process, be sure to scroll down in the installation option window and find the "Add Python.exe to Path", and click on the small red "x."
For the installation of
spaCy
and spacyr in macOS (in homebrew and for the default Python) and Windows you can find more detailed instructions for a Mac OS X Installation and Windows Installation. -
Install additional command-line compiler tools.
- Windows:
- Install Virtual Studio Express 2015.
- Install RTools.
- macOS:
Either install XCode from the App Store, or an abbreviated version using the following from Terminal:
xcode-select --install
- Linux: no additional tools are required.
- Windows:
-
Install spaCy.
Installation instructions for spaCy are available from spacy.io. In short, once Python is installed on your system:
pip install -U spacy spacy download en
You can test your installation at the command line using:
python -c "import spacy; spacy.load('en'); print('OK')"
Additional instructions are available from the spaCy website for installing using a
virtualenv
or an Anaconda installation. -
Install the spacyr R package:
- From GitHub:
To install the latest package from source, you can simply run the following.
devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)
- From CRAN:
install.packages("spacyr")
If you have multiple Python executables in your systems (for instance if you are a macOS user and have installed Python 3, you will also have the system-installed Python 2.7.x), then the spacy_initialize()
function will check whether each of them have spaCy installed or not.
You can also specify the python executable directly, when calling spacy_initialize()
. For instance, if your installation of spaCy uses /usr/local/bin/python
, then you could use:
library("spacyr")
spacy_initialize()
We welcome your comments and feedback. Please file issues on the issues page, and/or send us comments at kbenoit@lse.ac.uk and A.Matsuo@lse.ac.uk.
To allow R to access the underlying Python functionality, it must open a connection by being initialized within your R session.
We provide a function for this, spacy_initialize()
, which attempts to make this process as painless as possible by searching your system for Python executables, and testing which have spaCy installed. For power users (such as those with multiple installations of Python), it is possible to specify the path manually through the python_executable
argument, which also makes initialization faster. (You will need to change the value on your system of the Python executable.)
library("spacyr")
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in more than one python
## spacyr will use /anaconda/bin/python (because ask = FALSE)
## successfully initialized (spaCy Version: 2.0.2, language model: en)
The spacy_parse()
is spacyr's main function. It calls spaCy both to tokenize and tag the texts. It provides two options for part of speech tagging, plus options to return word lemmas, entity recognition, and dependency parsing. It returns a data.frame
corresponding to the emerging text interchange format for token data.frames.
The approach to tokenizing taken by spaCy is inclusive: it includes all tokens without restrictions, including punctuation characters and symbols.
Example:
txt <- c(d1 = "spaCy excels at large-scale information extraction tasks.",
d2 = "Mr. Smith goes to North Carolina.")
# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt
## doc_id sentence_id token_id token lemma pos entity
## 1 d1 1 1 spaCy spacy ADJ
## 2 d1 1 2 excels excel NOUN
## 3 d1 1 3 at at ADP
## 4 d1 1 4 large large ADJ
## 5 d1 1 5 - - PUNCT
## 6 d1 1 6 scale scale NOUN
## 7 d1 1 7 information information NOUN
## 8 d1 1 8 extraction extraction NOUN
## 9 d1 1 9 tasks task NOUN
## 10 d1 1 10 . . PUNCT
## 11 d2 1 1 Mr. mr. PROPN
## 12 d2 1 2 Smith smith PROPN PERSON_B
## 13 d2 1 3 goes go VERB
## 14 d2 1 4 to to ADP
## 15 d2 1 5 North north PROPN GPE_B
## 16 d2 1 6 Carolina carolina PROPN GPE_I
## 17 d2 1 7 . . PUNCT
Two fields are available for part-of-speech tags. The pos
field returned is the Universal tagset for parts-of-speech, a general scheme that most users will find serves their needs, and also that provides equivalencies across langages. spacyr also provides a more detailed tagset, defined in each spaCy language model. For English, this is the OntoNotes 5 version of the Penn Treebank tag set.
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)
## doc_id sentence_id token_id token pos tag
## 1 d1 1 1 spaCy ADJ JJ
## 2 d1 1 2 excels NOUN NNS
## 3 d1 1 3 at ADP IN
## 4 d1 1 4 large ADJ JJ
## 5 d1 1 5 - PUNCT HYPH
## 6 d1 1 6 scale NOUN NN
## 7 d1 1 7 information NOUN NN
## 8 d1 1 8 extraction NOUN NN
## 9 d1 1 9 tasks NOUN NNS
## 10 d1 1 10 . PUNCT .
## 11 d2 1 1 Mr. PROPN NNP
## 12 d2 1 2 Smith PROPN NNP
## 13 d2 1 3 goes VERB VBZ
## 14 d2 1 4 to ADP IN
## 15 d2 1 5 North PROPN NNP
## 16 d2 1 6 Carolina PROPN NNP
## 17 d2 1 7 . PUNCT .
For the German language model, the Universal tagset (pos
) remains the same, but the detailed tagset (tag
) is the TIGER Treebank scheme.
spacyr can extract entities, either named or "extended".
parsedtxt <- spacy_parse(txt, lemma = FALSE)
entity_extract(parsedtxt)
## doc_id sentence_id entity entity_type
## 1 d2 1 Smith PERSON
## 2 d2 1 North Carolina GPE
entity_extract(parsedtxt, type = "all")
## doc_id sentence_id entity entity_type
## 1 d2 1 Smith PERSON
## 2 d2 1 North Carolina GPE
Or, convert multi-word entities into single "tokens":
entity_consolidate(parsedtxt)
## doc_id sentence_id token_id token pos entity_type
## 1 d1 1 1 spaCy ADJ
## 2 d1 1 2 excels NOUN
## 3 d1 1 3 at ADP
## 4 d1 1 4 large ADJ
## 5 d1 1 5 - PUNCT
## 6 d1 1 6 scale NOUN
## 7 d1 1 7 information NOUN
## 8 d1 1 8 extraction NOUN
## 9 d1 1 9 tasks NOUN
## 10 d1 1 10 . PUNCT
## 11 d2 1 1 Mr. PROPN
## 12 d2 1 2 Smith ENTITY PERSON
## 13 d2 1 3 goes VERB
## 14 d2 1 4 to ADP
## 15 d2 1 5 North_Carolina ENTITY GPE
## 16 d2 1 6 . PUNCT
Detailed parsing of syntactic dependencies is possible with the dependency = TRUE
option:
spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)
## doc_id sentence_id token_id token head_token_id dep_rel entity
## 1 d1 1 1 spaCy 2 nsubj
## 2 d1 1 2 excels 2 ROOT
## 3 d1 1 3 at 2 prep
## 4 d1 1 4 large 6 amod
## 5 d1 1 5 - 6 punct
## 6 d1 1 6 scale 9 compound
## 7 d1 1 7 information 8 compound
## 8 d1 1 8 extraction 9 compound
## 9 d1 1 9 tasks 3 pobj
## 10 d1 1 10 . 2 punct
## 11 d2 1 1 Mr. 2 compound
## 12 d2 1 2 Smith 3 nsubj PERSON_B
## 13 d2 1 3 goes 3 ROOT
## 14 d2 1 4 to 3 prep
## 15 d2 1 5 North 6 compound GPE_B
## 16 d2 1 6 Carolina 4 pobj GPE_I
## 17 d2 1 7 . 3 punct
By default, spacyr loads an English language model. You also can load SpaCy's other language models or use one of the language models with alpha support by specifying the model
option when calling spacy_initialize()
. We have sucessfully tested following language models with spacy version 2.0.1.
Language | ModelName |
---|---|
German | de |
Spanish | es_core_news_sm |
Portuguese | pt |
French | fr |
Italian | it |
Dutch | nl |
This is an example of parsing German texts.
## first finalize the spacy if it's loaded
spacy_finalize()
spacy_initialize(model = "de")
## Python space is already attached. If you want to swtich to a different Python, please restart R.
## successfully initialized (spaCy Version: 2.0.2, language model: de)
txt_german <- c(R = "R ist eine freie Programmiersprache für statistische Berechnungen und Grafiken. Sie wurde von Statistikern für Anwender mit statistischen Aufgaben entwickelt.",
python = "Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache. Sie will einen gut lesbaren, knappen Programmierstil fördern.")
results_german <- spacy_parse(txt_german, dependency = TRUE, lemma = FALSE, tag = TRUE)
results_german
## doc_id sentence_id token_id token pos tag
## 1 R 1 1 R PROPN NE
## 2 R 1 2 ist AUX VAFIN
## 3 R 1 3 eine DET ART
## 4 R 1 4 freie ADJ ADJA
## 5 R 1 5 Programmiersprache NOUN NN
## 6 R 1 6 für ADP APPR
## 7 R 1 7 statistische ADJ ADJA
## 8 R 1 8 Berechnungen NOUN NN
## 9 R 1 9 und CONJ KON
## 10 R 1 10 Grafiken NOUN NN
## 11 R 1 11 . PUNCT $.
## 12 R 2 1 Sie PRON PPER
## 13 R 2 2 wurde AUX VAFIN
## 14 R 2 3 von ADP APPR
## 15 R 2 4 Statistikern NOUN NN
## 16 R 2 5 für ADP APPR
## 17 R 2 6 Anwender NOUN NN
## 18 R 2 7 mit ADP APPR
## 19 R 2 8 statistischen ADJ ADJA
## 20 R 2 9 Aufgaben NOUN NN
## 21 R 2 10 entwickelt VERB VVPP
## 22 R 2 11 . PUNCT $.
## 23 python 1 1 Python NOUN NN
## 24 python 1 2 ist AUX VAFIN
## 25 python 1 3 eine DET ART
## 26 python 1 4 universelle ADJ ADJA
## 27 python 1 5 , PUNCT $,
## 28 python 1 6 üblicherweise ADV ADV
## 29 python 1 7 interpretierte ADJ ADJA
## 30 python 1 8 höhere ADJ ADJA
## 31 python 1 9 Programmiersprache NOUN NN
## 32 python 1 10 . PUNCT $.
## 33 python 2 1 Sie PRON PPER
## 34 python 2 2 will VERB VMFIN
## 35 python 2 3 einen DET ART
## 36 python 2 4 gut ADJ ADJD
## 37 python 2 5 lesbaren ADJ ADJA
## 38 python 2 6 , PUNCT $,
## 39 python 2 7 knappen ADJ ADJA
## 40 python 2 8 Programmierstil NOUN NN
## 41 python 2 9 fördern VERB VVFIN
## 42 python 2 10 . PUNCT $.
## head_token_id dep_rel entity
## 1 2 sb
## 2 2 ROOT
## 3 5 nk
## 4 5 nk
## 5 2 pd
## 6 5 mnr
## 7 8 nk
## 8 6 nk
## 9 8 cd
## 10 9 cj
## 11 2 punct
## 12 2 sb
## 13 2 ROOT
## 14 10 sbp
## 15 3 nk LOC_B
## 16 4 mnr
## 17 5 nk
## 18 10 mo
## 19 9 nk
## 20 7 nk
## 21 2 oc
## 22 2 punct
## 23 2 sb MISC_B
## 24 2 ROOT
## 25 9 nk
## 26 9 nk
## 27 4 punct
## 28 7 mo
## 29 4 cj
## 30 9 nk
## 31 2 pd
## 32 2 punct
## 33 2 sb
## 34 2 ROOT
## 35 8 nk
## 36 5 mo
## 37 8 nk
## 38 5 punct
## 39 5 cj
## 40 9 oa
## 41 2 oc
## 42 2 punct
Note that the additional language models must first be installed in spaCy. The German language model, for example, can be installed (python -m spacy download de
) before you call spacy_initialize()
.
A background process of spaCy is initiated when you ran spacy_initialize()
. Because of the size of language models of spaCy, this takes up a lot of memory (typically 1.5GB). When you do not need the Python connection any longer, you can finalize the python connection (and terminate the process) by calling the spacy_finalize()
function.
spacy_finalize()
By calling spacy_initialize()
again, you can restart the backend spaCy.
If you want to skip spacyr searching for Python intallation with spaCy, you can do so by permanently setting the path to the spaCy-enabled Python by specifying it in an R-startup file, which is read every time a new R
is launched. For Mac/Linux, the file is ~/.Rprofile
and for
The syntax is:
options(spacy_python_setting = list(type = "python_executable",
py_path = "/the/path/to/python")) # e.g. "/usr/local/bin/python"
These lines can be directly inserted by a text editor. Or from R, enter the following (for Mac/Linux):
option_string <- 'options(spacy_python_setting = list(type = "python_executable",
py_path = "/the/path/to/python")) '
write(option_string, file = "~/.Rprofile", append = TRUE)
Once the file is appropriately set up, the message from spacy_initialize()
changes to something like:
## The python path is already set
## spacyr will use: python_executable = /usr/local/bin/python
## successfully initialized (spaCy Version: 2.0.1, language model: en)
Some of the token- and type-related standard methods from quanteda also work on the new tagged token objects:
require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
docnames(parsedtxt)
## [1] "d1" "d2"
ndoc(parsedtxt)
## [1] 2
ntoken(parsedtxt)
## d1 d2
## 10 7
ntype(parsedtxt)
## d1 d2
## 10 7
The Text Interchange Format is an emerging standard structure for text package objects in R, such as corpus and token objects. spacy_initialize()
can take a TIF corpus data.frame or character object as a valid input. Moreover, the data.frames returned by spacy_parse()
and entity_consolidate()
conform to the TIF tokens standard for data.frame tokens objects. This will make it easier to use with any text analysis package for R that works with TIF standard objects.