plyr::llply
over base::lapply
for tm_map.VCorpus
. Passing tm_map
any of plyr's options for llply
is supported.
Benchmark for removeWords
on a 10,000 document corpus with and without parallelization (2011 iMac 2.8 GHz Intel Core i5, 8 GB RAM, OS X 10.6.7, R version 2.13.0 Patched (2011-04-23 r55622), with 4 parallel workers). (Not necessarily representative of anything.)
Non-parallel:
system.time(tm::tm_map(inboundCorpus[1:10000], removeWords, myStopWords)) elapsed = 100.681
With llply
:
library(doMC) registerDoMC(cores=4) system.time(tmparallel::tm_map(inboundCorpus[1:10000], removeWords, myStopWords, .parallel=T, .progress='text')) elapsed = 53.809
With equivalent size snow
MPI cluster:
library(snow) makeCluster(4, type="MPI") system.time(tm::tm_map(inboundCorpus[1:10000], removeWords, myStopWords)) elapsed = 117.673This adds an option to
stemDocument
to specify stemmer="Rstem"
to use the implementation from Omegahat.org's Rstem package. This eliminates nasty dependencies on Rjava, etc. and has some performance advantages.
Benchmark for stemDocument
on a 48,415 document corpus with Rstem vs. Snowball stemmer and with Rstem and parallelization (2011 iMac 2.8 GHz Intel Core i5, 8 GB RAM, OS X 10.6.7, R version 2.13.0 Patched (2011-04-23 r55622), with 4 parallel workers). (Not necessarily representative of anything.)
With SnowballStemmer, non parallel:
system.time(tm_map(inboundCorpus, stemDocument, .progress='text')) user system elapsed 731.575 4.220 730.456
With Rstem, non parallel:
system.time(tm_map(inboundCorpus, stemDocument, "english", stemmer="Rstem", .progress='text')) user system elapsed 180.282 0.626 181.013
With Rstem, parallel:
system.time(tm_map(inboundCorpus, stemDocument, "english", stemmer="Rstem", .progress='text', .parallel=T)) user system elapsed 240.981 3.216 152.029
(See "http://download.r-forge.r-project.org/manuals/R-Forge_Manual.pdf" for detailed information on registering a new project.
- Introduction
R is free software distributed under a GNU-style copyleft. R-Forge is a central platform for the development of R packages, R-related software and further projects. Among many other web-based features it provides facilities for collaborative source code management via Subversion (SVN).
- The directory you're in
This is the repository of your project. It contains two important pre-defined directories namely 'www' and 'pkg'. They must not be deleted otherwise R-Forge's core functionality will not be available (daily checking and building of your package or the project websites). These two directories are standardized and therefore are going to be described in this README. The rest of your repository can be used as you like.
- 'pkg' directory
To make use of the package building and checking feature the package source code has to be put into the 'pkg' directory of your repository (i.e., 'pkg/DESCRIPTION', 'pkg/R', 'pkg/man', etc.) or, alternatively, a subdirectory of 'pkg'. The latter structure allows to have more than one package in a single project, e.g., if a project consists of the packages foo and bar then the source code is located in 'pkg/foo' and 'pkg/bar', respectively.
R-Forge automatically examines the 'pkg' directory of every repository and builds the package sources as well as the package binaries on a daily basis for Mac OSX and Windows (if applicable). The package builds are provided in the 'R Packages' tab for download or can be installed directly in R from a CRAN-style repository using 'install.packages("foo", repos="http://R-Forge.R-project.org")'. Furthermore, in the 'R Packages' tab developers can examine logs of the build and check process on different platforms.
- 'www' directory
Developers may present their work on a subdomain of R-Forge, e.g., 'http://foo.R-Forge.R-project.org', or via a link to an external website.
This directory contains the project homepage which gets updated hourly on R-Forge, so please take into consideration that it will not be available right after you commit your changes or updates.
- Help
If you need help don't hesitate to contact us (R-Forge@R-project.org)