piccolbo/dplyr.spark.hive

[Proposition] New 'classPath' parameter in 'src_Hive' -> 'src_HS2' functions

Closed this issue · 3 comments

Please forgive me for my technical/development ignorance. I am just a regular R user with statisical knowledge. I have a proposition of a small improvement (at least, I think so :) ) which might help the possible future users.

When I tried to use dplyr.spark.hive for the first time with src_Hive() function I didn't know I had to somehow specify any global environment/variable such as HADOOP_JAR. When the function didn't work, the first step I took, to figure out what I might have missed, was a quick look at the documentation page. I thought I might missed special parameter but there wasn't such possibility. That's why I have posted questions on stackoverflow and created issues here.

Maybe to avoid such problems with the user's lack of knowledge someone could add additional new parameter to src_Hive so that it would be easier to understand that user has to specify a .jar to create a JDBC driver. The name classPath would correspond to the name of the parameter in JDBC.

I think this might look like:

src_HS2 =
  function(host, port, class, final.env, classPath, ...) {
    driverclass = "org.apache.hive.jdbc.HiveDriver"
    dr = JDBC(driverclass, classPath = classPath)
    url = paste0("jdbc:hive2://", host, ":", port)
    con.class = paste0(class, "Connection")
    con =
      new(con.class, dbConnect_retry(dr, url, retry = 100, ...))
    pf = parent.frame()
    src_sql(
      c(class, "HS2"),
      con,
      info = list("Spark at", host = host, port = port),
      env = final.env,
      call = match.call(),
      calling.env = pf)}

src_Hive =
  function(
    host =
      first.not.empty(
        Sys.getenv("HIVE_SERVER2_THRIFT_BIND_HOST"),
        "localhost"),
    port =
      first.not.empty(
        Sys.getenv("HIVE_SERVER2_THRIFT_PORT"),
        10000),
   classPath = Sys.getenv("HADOOP_JAR"),
   ...){
    src_HS2(host, port, "Hive", NULL, classPath, ...)}

Also it would require some updates in the documentation page. I think such update will make the use of this package more clear and easier. If that proposition is OK I can proceed with a PR :)

I can follow your reasoning but this is breaking with a RHadoop project (the Revolution-sponsored project within which this package was started) convention, that is that env variables are the way packages are made aware of external dependencies. Some arguments against providing an alternate mechanism

  1. Two things to document and to explain, redundant. People may wonder if both are needed. In my APIs I aim for a canonical form, not many ways of doing the same thing when they don't add anything as far as fuctionality
  2. Some env variable information may be shared with other sw, for instance, any sw using RJDBC may tap into the same env variable
  3. It's an additional argument to a function and additional logic for something that arguably (but not certainly) is most likely always set to the same value (same argument though may apply to host and port though)
  4. It's part of a correct installation. Changing the API because some people skip installation steps, that's a slippery slope. A better solution is to figure out why it wasn't as clear as the fact that you need spark or hive installed

That said, I am not totally sold on my own arguments and need to think about it. Feel free to push back. Maybe there is an argument for R-only-shell-may-not-exist assumption. There is may be a point of view for instance of a windows user that's not as familiar with env variables, and that clearly doesn't come natural to me. The only thing I would like to see is a general answer to "how do we connect R with external dependencies" rather a case by case decision that may be determined by an accident like incomplete or unclear installation instructions.

You have convinced me it is not necessary. In my opinion it'd be better to write a manual/vignette for dummies (me) rather than adding new parameter. Since philosophy of parameter's simplicity is consistent with other RHadoop packages, this topic can be closed.

Thanks, feel free to reopen were there new elements to this discussion.