Enhancement - Ability to Parse a list
Marcel0024 opened this issue · 1 comments
Hi, i've been looking at this library, it's really promissing. It really saves a lot of time writing boiler plate.
But i'm missing one feature to really be able to use it for my use-case.
Is your feature request related to a problem? Please describe.
The issue i'm running into is i don't have to open each link to scrape them.
My first page is the page with listings and has pagination.
For example:
Page 1
[Listing 1]
[Name]
[Amount]
[Rating]
[Link]
[Listing 2]
[Name]
[Amount]
[Rating]
[Link]
[Listing 3]
[Name]
[Amount]
[Rating]
[Link]
pages [1] 2 3 4 5 6 ... 234
Page 2
[Listing 1]
[Name]
[Amount]
[Rating]
[Link]
[Listing 2]
[Name]
[Amount]
[Rating]
[Link]
[Listing 3]
[Name]
[Amount]
[Rating]
[Link]
pages 1 [2] 3 4 5 6 ... 234
The way the library is setup is, i have to .Follow(...)
each link and .Parse(..)
each one opened page. But in my case i don't have to. The data i need is on this page already.
Describe the solution you'd like
Ability to parse a List, maybe use a JArray for the object returned in the entity.
Describe alternatives you've considered
I didn't find a workaround. I did try something like this:
.Parse([..Enumerable.Range(0, 10).Select(x =>
{
return new Schema($"Listing{x}")
{
new SchemaElement("Name", " div.min-w-0 > a > h2"),
new SchemaElement("Amount", "div.min-w-0 > p.font-semibold")
};
})])
But all listing are the same, since the query selector just grabs the first one https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L85
Additional context
To keep backwards compatability, i think this needs to be implemented on SchemaElement
with a new property. Maybe IsList
or IsArray
.
In FillOutput()
https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L43
in the try
we can add differentiate if it's a list or not, if so, GetListData()
returns a list of data to adda JArray.
I'm willing to work on a PR with some guidance/approval.
Just realized you would have to change the Job
implementation as well
WebReaper/WebReaper/Domain/Job.cs
Line 17 in 988ea8c
Because every page would have to become a TargetPage
.
Damn there's no way to override this. I thought with a custom IContentParser
would do the trick, but ran into this.