[BUG] Web crawler searches through matches from the 1900s
Closed this issue ยท 4 comments
Describe the bug
The web crawler in feature-crawler
takes in match records from the 1900s . This wastes a lot of time and reduces efficiency of the crawler
To Reproduce
Steps to reproduce the behavior:
- Follow the instructions in the README file to run the crawler
- Wait for the Ids crawl to finish and notice
Expected behavior
The solution to this would be to set a filter which takes match records only from the year 2017 and greater
Possible solution
in cralwer/cricketcrawler/spiders/howstat.py
in function parse_scorecard
if int(date[0:4]) >= 2017:
item=MatchidItem(name=url[startint+10:],folder=folder,matchid=matchid,date=date)
yield item
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- Version [feature-crawler]
Additional context
The starting point to this might be crawler/cricketcrawler/spiders/howstat.py
Issue-Label Bot is automatically applying the label bug
to this issue, with a confidence of 0.97. Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231
So we only need to crawl these links:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231
ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231
So we only need to crawl these links:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231
This is an amazing hack. This would reduce the searching by a lot. Thanks I'll implement this soon once I'm free
Im not too sure if this will be needed
I am adding a wontfix label for now, until its figured out