HackerSpace-PESU/Best11-Fantasycricket

[BUG] Web crawler searches through matches from the 1900s

Closed this issue ยท 4 comments

Describe the bug
The web crawler in feature-crawler
takes in match records from the 1900s . This wastes a lot of time and reduces efficiency of the crawler
To Reproduce
Steps to reproduce the behavior:

  1. Follow the instructions in the README file to run the crawler
  2. Wait for the Ids crawl to finish and notice

Expected behavior
The solution to this would be to set a filter which takes match records only from the year 2017 and greater
Possible solution
in cralwer/cricketcrawler/spiders/howstat.py in function parse_scorecard

if int(date[0:4]) >= 2017:
     item=MatchidItem(name=url[startint+10:],folder=folder,matchid=matchid,date=date)
      yield item

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Version [feature-crawler]

Additional context
The starting point to this might be crawler/cricketcrawler/spiders/howstat.py

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.97. Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:

http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231

the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231

So we only need to crawl these links:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231

ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:

http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231

the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231

So we only need to crawl these links:
http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231
http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231

This is an amazing hack. This would reduce the searching by a lot. Thanks I'll implement this soon once I'm free

Im not too sure if this will be needed
I am adding a wontfix label for now, until its figured out