This repo contains the project resulting from this Youtube tutorial published by Adam Sever.
It consists basically on fetcing information about rent price, location, size and number of views for each comercial property listed a the daft.ie website. This info is fetched using XPath reference.
The video did not cover how to add the MongoDB pipeline to the project, although the source code can be found on the blog-link listed below.
Blog page containing further info about the tutorial.
Although the source code for the original project is available here, I couldn't make my scrapper work.
The output from running the command scrapy crawl daft -t csv -o data.csv
is available at outputCLI.txt above. This command generates an empty .csv file (available at data.csv) instead of outputting the scrapped data from the website.
Further features to be implemented on this project:
- use Postman to improve the structure of the website;
- use MongoDB pipeline to automate the insertion of data into the output file;
- use matplotlib to visualise the most significant insights from the scrapped data (data science).
(ENV) (base) C:\Users\laisb\python-virtual-environments\scrapyTutorial\scrapyWebTut>scrapy crawl daft -t csv -o data.csv The command is correct.
2020-02-02 20:36:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.daft.ie/robots.txt> (referer: None)
2020-02-02 20:36:01 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://0.0.0.0:8050/robots.txt> (failed 1 times): An error occurred while connecting: 10049: The requested address is not valid in its context..
2020-02-02 20:36:02 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://0.0.0.0:8050/robots.txt> (failed 2 times): An error occurred while connecting: 10049: The requested address is not valid in its context..
2020-02-02 20:36:02 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://0.0.0.0:8050/robots.txt> (failed 3 times): An error occurred while connecting: 10049: The requested address is not valid in its context..
It is interesting that the downladermiddlewares tries to access www.daft.ie robots.txt through http://0.0.0.0:8050. This only occurs when using splashRequest. There are workarounds however which can be found in scrapy-plugins/scrapy-splash#180. Which involves changing the build order. However I do not believe this is the issue you face, but this can be debugged by setting ROBOTSTXT_OBEY to False in settings.py and check if the result is differnet.
ROBOTSTXT_OBEY = False
On my Env I did not need to.
When I turned off my splash service my output was similar to yours. splash uses docker and on linux the command is
sudo docker run -p 8050:8050 scrapinghub/splash
https://docs.docker.com/engine/installation/windows/ to setup docker on windows. I added a note to add documentation for windows installation on the scrapy-splash repo
To check if splash is running vist the http://127.0.0.1:8050/ on a browser where you should see a webpage that looks like
I believe this may be the issue
The code in parse function in /spiders/daft.py
yield rent_price
yield location
yield size
yield how_many_times_views
Should return a Request, BaseItem, dict or None So I updated it to
yield {
'rent_price': rent_price,
'location': location,
'size': size,
'how_many_times_views': how_many_times_views
}
hope that helps! scrapy : 1.8.0 OS: Ubuntu 18.04.2 LTS Splash v3.3.1
MongoDB is NoSQL that has a very fluid development process with scrapy and python.
To note we do not need to configure the database in mongodb or create it the pymongo libary will do it for us.
A example of data I scraped and host in mongoDB can be see here https://cryt.ie/API/api.html#view_data
To begin we will add two variables to the scrapy settings file
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'daft'
We will then uncomment
ITEM_PIPELINES = {
'scrapyWebTut.pipelines.ScrapywebtutPipeline': 300,
}
in the settings we need to activate the pipeline, the number is the ID and will decide what pipelines get executed first.
the URI will points to mongo database I presume you will use mongoDB localy and not configure security measures for testing and development The database name is daft and is where we will store the data.
Will we need the libary pymongo can be installed with pip
In the pipeline ScrapywebtutPipeline We will override the init method
def __init__(self, mongo_uri, mongo_db, stats):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.stats = stats
This will simply set the variables from the settings file.
However we need to also override from_crawler which has access the the scrapy env and settings we set
@classmethod
def from_crawler(cls, crawler):
## pull in information from settings.py
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE'),
stats=crawler.stats
)
This will be called by the scrapy framework BEFORE Init and then call init with the params we give. This is standard and can be used for all your projects working with mongoDB
After we will override the open_spider method and init the pymongo libary and connection to the database
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
This is following the life cycle design pattern scrapy uses.
Now we will also override close_spider
def close_spider(self, spider):
self.client.close()
simply closing the database connection
Now we will we get to the process_item(self,item,spider) function Where we can filter, process and then store the data in our mongoDB In the case of this project we know the item will be a dict
yield {
'rent_price': rent_price,
'location': location,
'size': size,
'how_many_times_views': how_many_times_views
}
And we won't process or filter the data but simply insert the data into our mongoDB we will insert into the collection called "properties" inside the daft database using the connection we set up ealier. We will also return the item as we may add more pipelines that come after the mongoDB such as adding it to a MYSQL data etc.. not returning the item tells scrapy that you want to discard it and can cause annoying logic bugs.
def process_item(self, item, spider):
self.db['properties'].insert(item)
return item
And thats it we now stored the data in our mongoDB
To the data in mongo through the cmd We use the commands
mongo
show dbs
use daft
show collections
db.properties.find().pretty()
Pipelines are very handy as we can create differnet ones to do differnet things and reuse them in other projects A example may be
FilterDataPipeline: 10 AddDataThroughAIModalPipeline: 100 AddDataToMongoDBPipeline: 110 EmailDataToMePipeLine: 120
Feel free to ask any questions or another scraping project you would like help with!