HackerSpace-PESU/Best11-Fantasycricket

Setup Web-Crawler for daily updates

Closed this issue · 13 comments

Describe the Issue:
The player records in the data folder are outdated and static. Thus, they may not be enough to accurately predict player performances in current matches. Previous records were created from web-scraped data from howstat.com.

Solution:
Keep the records up to date using web-scraping for daily updates and reflect those changes in the data folder.

Comment if you would like to work on this

If i'd get a little more info i'd like to work on it using scrapy as a scraping framework if you dont mind.

What do you mean reflect htoe changes in the data folder? Should i keep log on updates made to the data or just make a script which updates the data daily.

Any web scraping tool is welcome, we have currently odi records for each player , under zip(matchids and dates),.zip2(batting records), bowl, wk. As these current players play further matches, it simultaneously should update in those folders,
As of now you can update only the zip folder, which contains matchids and dates for each player, if there's any new non retired player whose records have been put up and/or, if the current players have played a new series of ODI cricket, we would like you to add it, for example Pakistan and the england players have played a series recently, we would like you to update those records

Every thing must be scraped from howstat.com
you can also update zip2, bowl, and wk using the scoring table in Dataset.md but to put a PR, updating the zip folder is sufficient as of now

I would suggest make a script which updates it

Two things:

  1. id like to be assigned
  2. is it a problem if i use the shortened version of the player name for now as filename, because for the other i'd need to crawl the player page also and link the information together which would be kinda complicated for now.

also its kinda finished, i'm just fixing bugs atm:
https://github.com/scientes/Best11-Fantasycricket/tree/webcrawler

currently it recrawls everything but that is a problem i need to fix later. (atm im using a httpcache for development so pages aren't crawled twice but that doesn't help)

So: do i need to filter out retired players? or do you want all.

  1. done
  2. Not at all, I just used the long player names as I scraped them like that

Retired players, I dont mind keeping them, Its your call if you want to remove them

For non - Retired players , check this out http://www.howstat.com/cricket/Statistics/Players/PlayerListCurrent.asp
As they get removed from this list ,they are conisdered as retired as per http://www.howstat.com/cricket/Statistics/Players/PlayerMenu.asp

Ah thx
Other Issue:
Git is creating problems for me because the total amount of files is very large due to there being a total amount of 5038 Players wouldn't it be better to make one file per folder and just filter on usage or should i just push all 5k files(in the future x3 or so due to zip2,bowl,wk)?
with zip2 and the rest i'm a bit lost on how to calculate because i'm not familar with cricket at all, but those categories are easy to implement with the current crawler.

I didnt understand what you mean by one file per folder. If you could elaborate on that ,It would be helpful
Once the folder zip is implemented, zip2 bowl and wk is just a simple function away, so it fine if you dont implement zip2, bowl and wk

Well i mean currently you generate one file per player per folder (a bit less because not everyone is in bowl and wk to my knowledge) with 5000 or so total Players you generate approx 15k-20k Files containing a total of maybe 20-30mb that's a load of files for this little data. It would probably be wise to store all data for zip in one file all for zip2 in one file and so on. I mean you are already using pandas why bother splitting up the data and not just filter in pandas

ohh you mean like one file called zip.csv, zip2.csv ,bowl.csv,wk.csv?

If this is the case, then how will you adjust for each player,
are you suggesting something like

player matches
player1 matchid1
matchid2
matchid3
player2 matchid1

and so on

This actually sounds do-able

Yes that was my idea.

Closed! Thanks to @scientes