piccolbo/dplyr.spark.hive

Performing CRAN CHECK

Closed this issue · 10 comments

I've downloaded source code of the master branch and performed CRAN CHECK and it looks like there is a serious warning with tests. Is it only the case of my local machine and not configured Java or this package has never passed a CRAN CHECK before? Here are results of a CRAN CHECK on my local machine

==> devtools::check(document = FALSE)

Setting env vars ---------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building dplyr.spark.hive ------------------------------------------------------
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  build '/home/mkosinski/dplyr.spark.hive/pkg' --no-resave-data --no-manual 

* checking for file/home/mkosinski/dplyr.spark.hive/pkg/DESCRIPTION... OK
* preparingdplyr.spark.hive:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* buildingdplyr.spark.hive_0.5.0.tar.gzSetting env vars ---------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
_R_CHECK_CRAN_INCOMING_           : FALSE
_R_CHECK_FORCE_SUGGESTS_          : FALSE
Checking dplyr.spark.hive ------------------------------------------------------
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  check '/tmp/RtmpQr8pYx/dplyr.spark.hive_0.5.0.tar.gz' --as-cran --timings 

* using log directory/home/mkosinski/dplyr.spark.hive/dplyr.spark.hive.Rcheck* using R version 3.2.2 (2015-08-14)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using option--as-cran* checking for filedplyr.spark.hive/DESCRIPTION... OK
* checking extension type ... Package
* this is packagedplyr.spark.hiveversion0.5.0* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagedplyr.spark.hivecan be installed ... WARNING
Found the following significant warnings:
  Warning: changing locked binding foroverindplyrwhilst loadingdplyr.spark.hiveWarning: changing locked binding forpartial_evalindplyrwhilst loadingdplyr.spark.hiveWarning: changing locked binding fordefault_opindplyrwhilst loadingdplyr.spark.hiveWarning: replacing previous import bypurrr::order_bywhen loadingdplyr.spark.hiveWarning: replacing previous import bypurrr::%>%’ when loadingdplyr.spark.hiveSee/home/mkosinski/dplyr.spark.hive/dplyr.spark.hive.Rcheck/00install.outfor details.
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... NOTE
Unexported objects imported by ':::' calls:dplyr:::auto_copy’ ‘dplyr:::build_query’ ‘dplyr:::collect.tbl_sql’
  ‘dplyr:::common_by’ ‘dplyr:::copy_to.src_sql’
  ‘dplyr:::db_save_query.DBIConnection’ ‘dplyr:::over’
  ‘dplyr:::partition_group’ ‘dplyr:::sql_vector’
  ‘dplyr:::update.tbl_sql’ ‘dplyr:::uses_window_funSee the note in ?`:::` about the use of this operator.
package 'methods' is used but not declared
* checking S3 generic/method consistency ... WARNING
Warning: declared S3 method 'intersect.tbl_HS2' not found
Warning: declared S3 method 'union.tbl_HS2' not found
See sectionGeneric functions and methodsin theWriting R
Extensionsmanual.
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'load_to':load_to.src_Hive’ ‘load_to.src_SparkSQLObjects in \usage without \alias in documentation object 'tbls':tbls.src_sqlFunctions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapterWriting R documentation filesin theWriting R
Extensionsmanual.
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking for unstated dependencies intests... OK
* checking tests ...
  Runningdatabases.RERROR
Running the tests intests/databases.Rfailed.
Last 13 lines of output:
  Warning: changing locked binding for 'over' in 'dplyr' whilst loading 'dplyr.spark.hive'
  Warning: changing locked binding for 'partial_eval' in 'dplyr' whilst loading 'dplyr.spark.hive'
  Warning: changing locked binding for 'default_op' in 'dplyr' whilst loading 'dplyr.spark.hive'
  Warning messages:
  1: replacing previous import by 'purrr::order_by' when loading 'dplyr.spark.hive' 
  2: replacing previous import by 'purrr::%>%' when loading 'dplyr.spark.hive' 
  > 
  > copy_to_from_local = dplyr.spark.hive:::copy_to_from_local
  > 
  > my_db = src_SparkSQL()
  Error in .jfindClass(as.character(driverClass)[1]) : class not found
  Calls: src_SparkSQL -> src_HS2 -> JDBC -> is.jnull -> .jfindClass
  Execution halted
* checking PDF version of manual ...
 OK
* DONE
Status: 1 ERROR, 3 WARNINGs, 1 NOTE

See/home/mkosinski/dplyr.spark.hive/dplyr.spark.hive.Rcheck/00check.logfor details.

Error: Command failed (1)
Execution halted

Exited with status 1.

Session Info (packages version)

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=pl_PL.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pl_PL.UTF-8        LC_COLLATE=pl_PL.UTF-8    
 [5] LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=pl_PL.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] quickcheck_3.5.1 nycflights13_0.1 ggplot2_2.0.0   
 [4] Lahman_4.0-1     lazyeval_0.1.10  purrr_0.2.0     
 [7] dplyr_0.4.3      RJDBC_0.2-5      rJava_0.9-6     
[10] DBI_0.3.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2      magrittr_1.5     MASS_7.3-45     
 [4] munsell_0.4.2    colorspace_1.2-6 R6_2.1.1        
 [7] stringr_1.0.0    plyr_1.8.3       tools_3.2.2     
[10] parallel_3.2.2   functional_0.6   grid_3.2.2      
[13] gtable_0.1.2     htmltools_0.3    yaml_2.1.13     
[16] assertthat_0.1   digest_0.6.8     crayon_1.3.1    
[19] pryr_0.1.2       codetools_0.2-14 bitops_1.0-6    
[22] testthat_0.11.0  memoise_0.2.1    rmarkdown_0.9   
[25] stringi_1.0-1    scales_0.3.0 

I don't use devtools but you have HADOOP_JAR unset, most likely. I run R CMD check and it passes. The warnings unfortunately are a consequence of monkey patching, the alternative being a fork of dplyr, or Hadley acting on the issues I submit, both unlikely events. It was not easy to make this backend work and I had to employ some "advanced techniques" -- aka hacks. The missing doc entries warnings are a choice on my part not to expose methods to the end user, I think that's a bug in R CMD check, but it may be fixed in my next life, so I just live with the warnings.

What If I need to pass more than one .jar to HADOOP_JAR?
I think I need to specify such .jars in the classPath in the JDBC function in that line https://github.com/piccolbo/dplyr.spark.hive/blob/master/pkg/R/src-HS2.R#L38 :

JDBC(driverclass,classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                  #"/opt/hive/lib/commons-configuration-1.6.jar",
                 "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
                   "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"))

Ok then HADOOP_JAR should be assigned as

Sys.setenv(HADOOP_JAR = paste0(classPath, collapse=.Platform$path.sep)

since JDBC splits classPath like this:

classPath <- path.expand(unlist(strsplit(classPath, .Platform$path.sep)))

Is this a theory or did the package clear the checks with this setting?

R CMD CHECK can't be performed in my case, since I can not pass user authentication in the current implementation of src_H2S as described here #18

R CMD CHECK runs a standalone spark instance and doesn't require any authorization. From my review of your changes, authentication params are optional, so I don't understand your explanation. If you are correct, we may need to change things a bit. By the way, dev failed on unrelated issues. Better merge from master first. I will let you know shortly.

It took me a while to figure out what should be specified in HADOOP_JAR and how I should pass port and host to create src_SparkSQL()

classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
              "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
              "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar",
              "/opt/spark-1.5.2-bin-hadoop2.4/lib/spark-assembly-1.5.2-hadoop2.4.0.jar",
              "/usr/share/java/slf4j/log4j-over-slf4j.jar",
              "/opt/wpusers/r-wpitula/hadoop-conf/log4j.properties")
Sys.setenv(HADOOP_JAR = paste0(classPath, collapse=.Platform$path.sep))

for src_Hive() I only use 3 first elemens of classPath vector. Then after setting such environment variables

Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
Sys.setenv(HIVE_SERVER2_THRIFT_PORT = "10000/loghost;auth=noSasl")
Sys.setenv(SPARK_HOME = "/opt/spark-1.5.2-bin-hadoop2.4/")

I've managed to create src_SparkSQL() and perform a simple select statement (the same with src_Hive()). I am encountering an issue with tests in tests directory for which CRAN CHECK throws an error for first example

==> devtools::check(document = FALSE)

Setting env vars ---------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building dplyr.spark.hive ------------------------------------------------------
'/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  build '/var/wpusers/mkosinski/dplyr.spark.hive/pkg' --no-resave-data  \
  --no-manual 

* checking for file/var/wpusers/mkosinski/dplyr.spark.hive/pkg/DESCRIPTION... OK
* preparingdplyr.spark.hive:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* buildingdplyr.spark.hive_0.5.0.tar.gzSetting env vars ---------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
_R_CHECK_CRAN_INCOMING_           : FALSE
_R_CHECK_FORCE_SUGGESTS_          : FALSE
Checking dplyr.spark.hive ------------------------------------------------------
'/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  check '/tmp/RtmprIc8WK/dplyr.spark.hive_0.5.0.tar.gz' --as-cran --timings 

* using log directory/var/wpusers/mkosinski/dplyr.spark.hive/dplyr.spark.hive.Rcheck* using R version 3.1.3 (2015-03-09)
* using platform: x86_64-redhat-linux-gnu (64-bit)
* using session charset: UTF-8
* using option--as-cran* checking for filedplyr.spark.hive/DESCRIPTION... OK
* checking extension type ... Package
* this is packagedplyr.spark.hiveversion0.5.0* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagedplyr.spark.hivecan be installed ... WARNING
Found the following significant warnings:
  Warning: class "JDBCConnection" is defined (with package slotRJDBC’) but no metadata object found to revise subclass information---not exported?  Making a copy in packagedplyr.spark.hiveWarning: class "DBIConnection" is defined (with package slotDBI’) but no metadata object found to revise subclass information---not exported?  Making a copy in packagedplyr.spark.hiveWarning: class "DBIObject" is defined (with package slotDBI’) but no metadata object found to revise subclass information---not exported?  Making a copy in packagedplyr.spark.hiveWarning: changing locked binding foroverindplyrwhilst loadingdplyr.spark.hiveWarning: changing locked binding forpartial_evalindplyrwhilst loadingdplyr.spark.hiveWarning: changing locked binding fordefault_opindplyrwhilst loadingdplyr.spark.hiveWarning: replacing previous import bypurrr::%>%’ when loadingdplyr.spark.hiveWarning: replacing previous import bypurrr::order_bywhen loadingdplyr.spark.hiveSee/var/wpusers/mkosinski/dplyr.spark.hive/dplyr.spark.hive.Rcheck/00install.outfor details.
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... NOTE
Unexported objects imported by ':::' calls:dplyr:::auto_copy’ ‘dplyr:::build_query’ ‘dplyr:::collect.tbl_sql’
  ‘dplyr:::common_by’ ‘dplyr:::copy_to.src_sql’
  ‘dplyr:::db_save_query.DBIConnection’ ‘dplyr:::over’
  ‘dplyr:::partition_group’ ‘dplyr:::sql_vector’
  ‘dplyr:::update.tbl_sql’ ‘dplyr:::uses_window_funSee the note in ?`:::` about the use of this operator.
package 'methods' is used but not declared
* checking S3 generic/method consistency ... WARNING
Warning: declared S3 method 'intersect.tbl_HS2' not found
Warning: declared S3 method 'union.tbl_HS2' not found
See sectionGeneric functions and methodsin theWriting R
Extensionsmanual.
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'load_to':load_to.src_Hive’ ‘load_to.src_SparkSQLObjects in \usage without \alias in documentation object 'tbls':tbls.src_sqlFunctions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapterWriting R documentation filesin theWriting R
Extensionsmanual.
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking for unstated dependencies in tests ... OK
* checking tests ...
  Runningdatabases.RERROR
Running the tests intests/databases.Rfailed.
Last 13 lines of output:
  +     tbl(my_db, "flights")
  +   else{
  +     copy_to_from_local(my_db, flights, "flights")}}
  > flights
  Source: Spark at:tools-1.hadoop.srv:10000/loghost;auth=noSasl
  From: flights [0 x 16]

  Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
    Unable to retrieve JDBC result set for CREATE TABLE `ebkxyszehg` AS SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`, `arr_time`, `arr_delay`, `carrier`, `tailnum`, `flight`, `origin`, `dest`, `air_time`, `distance`, `hour`, `minute`
  FROM `flights`
  LIMIT 0 (The query did not generate a result set!)
  Calls: print ... dbSendQuery -> dbSendQuery -> .local -> .verify.JDBC.result
  Execution halted
 [3s/45s]
Error: Command failed (1)
Execution halted

Exited with status 1.

But this might be caused that I perform CRAN CHECK on dev, instead of master branch.

You are right dev may be in some odd state sometime, but it checks clean for me, after I merged from dev. You may need to pull from dev once more.

to be continued on rzilla fork