dfdx/Spark.jl

New version does not work as stated in docs

Closed this issue · 6 comments

While I would really like to see this project succeed, every time I try to use it new issues surface

  1. I reported previously as issue #108 that the line 86 in file init.jl was causing an error
    p = split(read(spark_defaults_conf, String), '\n', keepempty=false)
  2. The fix provided by Andrei worked:
    p = split(Base.read(spark_defaults_conf, String), '\n', keepempty=false)
  3. Unfortunately the issue resurfaced in the new version. Line 86 is back to
    p = split(read(spark_defaults_conf, String), '\n', keepempty=false)
    which obviously is easily 'fixable',
  4. Unfortunately the changes to the new version v0.61 cause failures elsewhere.
  5. Before adding/building/using the new code from v061, I removed the Spark package and also erased the source files in my local machine where the Spark files were maintainted.
  6. Reinstalled the new version of spark
using Pkg; Pkg.add("Spark")
ENV["BUILD_SPARK_VERSION"] = "3.2.1"   # version you need
Pkg.build("Spark")
  1. Tried to follow the code example as given in the documentation of the new version:
using Spark.SQL #### ERROR: SQL does not exist

using Spark
spark = Spark.SparkSession.builder.appName("Main").master("local").getOrCreate()
df = spark.createDataFrame([["Alice", 19], ["Bob", 23]], "name string, age long")

###### Exception in thread "main" java.lang.NoSuchMethodError: make
###  JavaCall.JavaCallError("Error calling Java: java.lang.NoSuchMethodError: make")

Stacktrace:
  [1] geterror(allow::Bool)
    @ JavaCall ~/.julia/packages/JavaCall/MlduK/src/core.jl:418
  [2] jcall(typ::Type{JavaCall.JavaObject{Symbol("scala.collection.mutable.ArraySeq")}}, method::String, rettype::Type, argtypes::Tuple{DataType}, args::JavaCall.JObject)
    @ JavaCall ~/.julia/packages/JavaCall/MlduK/src/core.jl:226
  [3] convert
    @ ~/.julia/packages/Spark/0luxD/src/convert.jl:81 [inlined]
  [4] Row
    @ ~/.julia/packages/Spark/0luxD/src/row.jl:16 [inlined]
  [5] iterate
    @ ./generator.jl:47 [inlined]
  [6] _collect(c::Vector{Vector{Any}}, itr::Base.Generator{Vector{Vector{Any}}, Type{Row}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:744
  [7] collect_similar
    @ ./array.jl:653 [inlined]
dfdx commented

Can you post the output of this (in Julia console)?

] st Spark
dfdx commented

Let's go through your code piece by piece:

using Pkg; Pkg.add("Spark")
Pkg.add("CSV"); using CSV

I'm not sure why you refer to CSV here - it is a completely different and independent package. Please find its docs and report any issues in the linked repository.

ENV["BUILD_SPARK_VERSION"] = "3.2.1"   # version you need
Pkg.build("Spark")

You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.

using Spark
import Spark.SparkSession
Spark.init()

using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.

Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:

#  code beow DID NOT WORK
# df = CSV.File(csvDir*f1*".csv"; header=true) |> DataFrame

As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:

# code below worked
# fCsv = spark.read.option("header", true).csv(csvDir*f1*".csv")

Next:

dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame

Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:

dfGroceries = spark.createDataFrame(rowGroceries)   # schema will be inferred automatically

Now you can call .first() and .show() on this data frame:

julia> dfGroceries.first()
[banana,yellow,1.0,10]

julia> dfGroceries.show()
+------+------+-----+------+
|  type| color|price|amount|
+------+------+-----+------+
|banana|yellow|  1.0|    10|
| apple|yellow|  2.0|    20|
|carrot|yellow|  0.5|    30|
| grape|  blue| 0.25|    50|
+------+------+-----+------+

Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().


As a bottom line:

  1. Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
  2. Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
  3. Don't assume methods. If something is not directly mentioned in the Spark.jl's docs, you can check out PySpark's docs or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl).
dfdx commented

Closing because of no activity.