databricks/Spark-The-Definitive-Guide

Ch 6 Regex. Code is wrong

resnad opened this issue · 3 comments

I'm learning Spark using the book and I ran into this code which returns an error:

from pyspark.sql.functions import expr, locate

simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
  return locate(color_string.upper(), column)\
          .cast("boolean")\
          .alias("is_" + c)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr("*")) # has to be a Column type

df.select(*selectedColumns).where(expr("is_white OR is_red"))\
  .select("Description").show(3, False)

Error: NameError: name 'c' is not defined

Even if I define a dummy 'c' within the function, I get the following:

AnalysisException: "cannot resolve '`is_white`' given input columns: [Quantity, StockCode, is_, InvoiceNo, is_, is_, CustomerID, is_, Description, InvoiceDate, UnitPrice, is_, Country]; line 1 pos 0;\n'Filter ('is_white || 'is_red)\n+- Project [cast(locate(BLACK, Description#14, 1) as boolean) AS is_#1632, cast(locate(WHITE, Description#14, 1) as boolean) AS is_#1633, cast(locate(RED, Description#14, 1) as boolean) AS is_#1634, cast(locate(GREEN, Description#14, 1) as boolean) AS is_#1635, cast(locate(BLUE, Description#14, 1) as boolean) AS is_#1636, InvoiceNo#12, StockCode#13, Description#14, Quantity#15, InvoiceDate#16, UnitPrice#17, CustomerID#18, Country#19]\n   +- Relation[InvoiceNo#12,StockCode#13,Description#14,Quantity#15,InvoiceDate#16,UnitPrice#17,CustomerID#18,Country#19] csv\n"

Any suggestions?

Thank you!

c should be color_string I renamed the variable at the last minute and ... now the code is forever slightly off :P.

Can you make a pull request to fix it?

I just checked and the code is correct in the repository haha. I must have an old version of the book? Thanks for the swift response anyway!

Errata happens. I'll close this.