helgeho/ArchiveSpark

Question on using "filterExists"

xw0078 opened this issue · 2 comments

Hi,

I want to count records that are including <script> tag inside the page content

I am getting 0 result from following procedure:

val filtered = filtered.filterExists(Html.first("script"))
filtered_count = filtered.count()

On the other side, I get correct result by following:

val enriched = records.enrich(Html.first("script"))
val filtered = enriched.filterExists(Html.first("script"))
filtered_count = filtered.count()

As you mention in the document that:
"Filters the records in the dataset based on whether the given field exists. If the field is specified by an Enrich Function, it checks whether the Enrich Function has returned a result or has resulted in an enrichment."

Am I using "filterExists" correctly?

And, for the correct result, Is the enrich step and filter step doing repeating computation? Since enrich has already returned the result, and filterExists is going to check the enrich function again?

Hi, the way you are using it in your second example is exactly right. You can conceive a workflow in ArchiveSpark as a process that starts from small records (only metadata) that are extended with every enrich function you apply. You basically add the information of interest to the records. So without enriching your records with the <script> tags, this field does not exist and filterExists would filter our all records. No enrichment are applied twice as ArchiveSpark keeps track of what's available in a records and reuses this information, hence you don't need to worry about any wasted reads / writes.

To make it a bit more readable, I recommend to give the enrich functions a name (assign them to a variable). Also, usually everything would be applied lazily until you perform some action that has to perform the previous in order to deliver a result (like count). So in order to use your filtered dataset after the count without rerunning all previous steps, it makes sense to use cache here:

val ScriptTag = Html.first("script")
val enriched = records.enrich(ScriptTag).filterExists(ScriptTag).cache
enriched.count()

Thank you for the clarification.