rstudio/reticulate

Code gets hung up trying to run get_coherence() on gensim coherence model

Closed this issue · 14 comments

mlpost commented

I have a quarto document in Rstudio in which I'm using reticulate. I'm able to train an LDA model and get a coherence model, but once I try to get the coherence score, the code never stops running and doesn't produce an error code. I have to force quit R. This same code runs in Jupyter Notebook without issue.

reprex:

import gensim as gensim
from gensim import corpora, models, similarities, downloader

texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

# Create Dictionary
id2word = gensim.corpora.Dictionary(texts)

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Create lda model
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=2, id2word = id2word, iterations=50)

# Create coherence model
cm = gensim.models.coherencemodel.CoherenceModel(model=ldamodel, texts=texts, dictionary=id2word, coherence='c_v')

#get coherence - code gets hung up here
result = cm.get_coherence()

My memory usage still shows 50% unused.
I'm using R version 4.2.3, Rstudio 2022.12.0 Build 353, Quarto version 1.2, Python 3.10.4, and reticulate 1.30

Hi,

Sorry, I can't reproduce the problem in my setup.
Can you provide your reticulate::py_config()? I wonder if this could be related to using a conda environment?
Also what's your version of gensim?

Does this happen in a quarto notebook, or simply executing the code line by line in the IDE?

mlpost commented

Hi, thanks so much for looking into this.

This also happens when running chunk by chunk in an Rmarkdown, so I don't believe the issue is related directly to Quarto. And this doesn't happen when running chunk by chunk in a Jupyter Notebook without Reticulate in VS Code.

reticulate::py_config()

python: C:/Program Files/Python310/python.exe
libpython: C:/Program Files/Python310/python310.dll
pythonhome: C:/Program Files/Python310
version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]
Architecture: 64bit
numpy: C:/Users/myUserName/AppData/Roaming/Python/Python310/site-packages/numpy
numpy_version: 1.25.1

NOTE: Python version was forced by use_python function

pip list

gensim 4.3.1

Thanks @mlpost, I can reproduce the issue on Windows. Doesn't seem related to Python version or R versions as I have different versions on my Windows machine and the problem still happens.

It could be related to #1346 in the sense that parallelism is causing the problem.

mlpost commented

I'm not sure if this is helpful, but I just tried running the quarto doc in vscode. result = cm.get_coherence() yielded some odd output that I didn't get from RStudio:

#get coherence - code gets hung up here
result = cm.get_coherence()

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

R version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()WARNING: unknown option '-c'

' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

R version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64WARNING: unknown option '--multiprocessing-fork'

WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'

-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyri Natural language support but running in an English locale

R is a collaborative project withWARNING: unknown option '-c'

g many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

ht (C) 2023 The R Foundation for StTypeatistical Computing 'demo()' for s
ome demWARNING: unknown option '--multiprocessing-fork'

Pos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Tlype 'q()' to quit R.

atform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an
HTML browser interface R version 4.2to help.
Type 'q()' to quit R.

.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mi
ngw32/x64 (64-bit)

R version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/xR i64s free s (64-bit)

oftware and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certaR is free software and comes with ABSOLUTELin conditions.
Type 'liY censNO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'le()' or
icen 'licence()'R for distribution version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Bease()' or 'licence()' for distribution
d
details.
gle"
Ce
R vopyright (C) 2023 The R Foersion 4.2.3 (2023-03-15 ucrt) -- "Sundation for Statistical Computing
Platform: x86 thortstop Beaails.

Natural language support but runn Natural_6 language support 4-w64-mingw32/x64 (64-glbut running in an English locae"
Copyright (C) 2023 ing in an English lle

R is a collaboratThe R Fouive project with many contriondation for Statbit)butors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

i

cstical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R version 4.2.
le

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how tTR3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

ype 'demo()' for some demos, 'help()' for on-line help, or
'help.sta is free softwareRoR i is free softrt(war)' for an HTML browser interface and comes with s free software and comes with ABSOLUTELY NOcite R or R packages in publications.

ABS WARRANTY.
You are wtOLUTELY NO WARRAN
eelcome to redistribute it under certain conditions.
Type 'license()'R TY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

o or 'licence()' for distriT
version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R version 4.2.3 (2023-03-15 ucrt) yR is freeand comes with ABSOLUTELY NO WARRANTY.
You are welco bution details.

peNme to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

-- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more informat software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

Natural language support but running in an English locR is free software and comes wit help.
Type atural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'ih ABSOLUTELY NO WARRANTY.
co certain conditions.istribute it under' a
Type 'license()' or 'licence()' for distribution details.

n and
'citation()' on how to cite R or R packages iiRq()' to quit R.

Nln publications.e
Natural language support but

aR is a collaborative project with mtation()' on how to cite R or R packages in publications.

t TrauvType 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

uny contributors.
Type 'contributors()' for more informatioenning in an English locyrn and
'citation()ale

R is a collabopsion 4.2.3 (2023-03-erarative ' on how to cite R orl project with man'demo()' for some demos, '15 ucrt) --
"Shortstop help(language support but running in an English locale

R is a collaborative project with)'y many contributors.
TB for on-line help, or
'help.start()' for an HTML browRype 'contributors()ser interface e ' for more information and
'citation()' cto help.
Type 'q()' to quit R.

ooagle"
Copyright (C) 2023 The R Foundation for Statistical Computi packages in publications.

nng
nTy.form: x86_64-w6tributors
Type 'contributors(pe 'd4-mingw32/x64 (64-bit)

)' for more information and
'cit >emo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface toR is free softwareatiohow to cite R or R packages i nd comes help.
Type 'q()' to quit R.

n npublications.

with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Type 'd Natural language support but running in an E()'nglish locale

R is a collaborative project wi on how to cite R or R packages in publications.

th many eType 'demo()' fcontributmor some demos, 'help()' for on-line hors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

eo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Tlypp, or
'hele 'demo()' for some demos, 'help()' for on-line help, or
'hp.start(elp.start()' for an H)' for an HTML browser interface to help.
Type TML 'q()' to quit R.

browser interface to help.
Type 'q()' to quit R.

Hmm that's helpful. I believe this can be related to:

reticulate/R/package.R

Lines 248 to 251 in 383d4e7

if (is_windows()) {
# patch sys.executable to point to python.exe, not Rterm.exe or rsession-utf8.exe, #1258
py_run_string_impl("import sys; sys.executable = sys.argv[0]", local = TRUE)
}

not propagating correctly.
It seems that gensim is trying to spin up other python processes using the sys.executable but that is pointing to the R executable.

To reproduce the error we just need to call multiprocessing.Process() on Windows. The RStudio IDE swallows the stderr output in this instance, but running R in cmd.exe shows this:

> reticulate::repl_python()
Python 3.11.2 (C:/Users/kalin/Documents/.virtualenvs/r-reticulate/Scripts/python.exe)
Reticulate 1.31.0.9000 REPL -- A Python interpreter in R.
Enter 'exit' or 'quit' to exit the REPL and return to R.
>>> from multiprocessing import Process
>>> import os
>>>
>>> def info(title):
...     print(title)
...     print('module name:', __name__)
...     print('parent process:', os.getppid())
...     print('process id:', os.getpid())
...
>>> def f(name):
...     info('function f')
...     print('hello', name)
...
>>> if __name__ == '__main__':
...     info('main line')
...     p = Process(target=f, args=('bob',))
...     p.start()
...     p.join()
...
main line
module name: __main__
parent process: 20360
process id: 10172
WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'


R version 4.3.0 (2023-04-21 ucrt) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Here is a snippet, (copy pasted from https://docs.python.org/3/library/multiprocessing.html) that produces the error

from multiprocessing import Process
import os

def info(title):
    print(title)
    print('module name:', __name__)
    print('parent process:', os.getppid())
    print('process id:', os.getpid())

def f(name):
    info('function f')
    print('hello', name)

if __name__ == '__main__':
    info('main line')
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

A more minimal MRE:

Process <- reticulate::import("multiprocessing")$Process
p <- Process(target = function() cat("HI\n"))
p$start()

Which gives output:

> p$start()
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
  AttributeError: Can't pickle local object 'make_python_function.<locals>.python_function'
Run `reticulate::py_last_error()` for details.
> WARNING: unknown option '-c'

WARNING: unknown option '--multiprocessing-fork'


R version 4.3.0 (2023-04-21 ucrt) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Interestingly, it drops you into a fresh R session, which you can interact with, and then exit, returning back to the launching R session.

Calling commandArgs() in this new session gives this:

> commandArgs(trailingOnly=F)
[1] "C:\\PROGRA~1\\R\\R-43~1.1\\bin\\x64\\Rterm.exe"
[2] "-c"
[3] "from multiprocessing.spawn import spawn_main; spawn_main(parent_pid=18284, pipe_handle=820)"
[4] "--multiprocessing-fork"

It looks related to: #1458, it should find the correct Python executable, right?

I think that issue was specific to the subprocess module and the IDE - this is something different since we can reproduce outside the IDE.

My current guess is that https://github.com/python/cpython/blob/c494fb333b57bdf43fc90189fc29a00c293b2987/Lib/multiprocessing/spawn.py#L88C61-L88C61 is creating a cmdline that launches R instead of Python.

Maybe, it's because the multiprocessing module caches the executable before we have a chance to modify it: https://github.com/python/cpython/blob/c494fb333b57bdf43fc90189fc29a00c293b2987/Lib/multiprocessing/spawn.py#L45
We may have to patch the module directly, or install a hook that calls multiprocessing.set_executable() on first import.

It would seem that the multiprocessing module does not use sys.executable on windows, hence the source of the issue.
Instead, it uses the Windows API directly, using either the PID of the launching process to resolve the handle:
https://github.com/python/cpython/blob/c494fb333b57bdf43fc90189fc29a00c293b2987/Lib/multiprocessing/spawn.py#L108C50-L108C50 or even more directly: https://github.com/python/cpython/blob/c494fb333b57bdf43fc90189fc29a00c293b2987/Lib/multiprocessing/reduction.py#L74C45-L74C45

Figuring the best way patch this correctly will require some investigation.

This is where the executable get reset to the 'wrong' Rterm.exe - looks like a special branch for virtual environments on Windows:
https://github.com/python/cpython/blob/546cab84448b892c92e68d9c1a3d3b58c13b3463/Lib/multiprocessing/popen_spawn_win32.py#L63C55-L63C55

Fixed on main now. Please try the development version:

remotes::install_github("rstudio/reticulate")
mlpost commented

I just tried it and it worked! Thanks so much!!