Ch 6 Regex. Code is wrong
resnad opened this issue · 3 comments
resnad commented
I'm learning Spark using the book and I ran into this code which returns an error:
from pyspark.sql.functions import expr, locate
simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
return locate(color_string.upper(), column)\
.cast("boolean")\
.alias("is_" + c)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr("*")) # has to be a Column type
df.select(*selectedColumns).where(expr("is_white OR is_red"))\
.select("Description").show(3, False)
Error: NameError: name 'c' is not defined
Even if I define a dummy 'c' within the function, I get the following:
AnalysisException: "cannot resolve '`is_white`' given input columns: [Quantity, StockCode, is_, InvoiceNo, is_, is_, CustomerID, is_, Description, InvoiceDate, UnitPrice, is_, Country]; line 1 pos 0;\n'Filter ('is_white || 'is_red)\n+- Project [cast(locate(BLACK, Description#14, 1) as boolean) AS is_#1632, cast(locate(WHITE, Description#14, 1) as boolean) AS is_#1633, cast(locate(RED, Description#14, 1) as boolean) AS is_#1634, cast(locate(GREEN, Description#14, 1) as boolean) AS is_#1635, cast(locate(BLUE, Description#14, 1) as boolean) AS is_#1636, InvoiceNo#12, StockCode#13, Description#14, Quantity#15, InvoiceDate#16, UnitPrice#17, CustomerID#18, Country#19]\n +- Relation[InvoiceNo#12,StockCode#13,Description#14,Quantity#15,InvoiceDate#16,UnitPrice#17,CustomerID#18,Country#19] csv\n"
Any suggestions?
Thank you!
bllchmbrs commented
c
should be color_string
I renamed the variable at the last minute and ... now the code is forever slightly off :P.
Can you make a pull request to fix it?
resnad commented
I just checked and the code is correct in the repository haha. I must have an old version of the book? Thanks for the swift response anyway!
bllchmbrs commented
Errata happens. I'll close this.