Scraping property details from https:/fotocasa.es/ and store it in Postgresql database.
- Webscraping property details from fotocasa website
- Rotating proxy to bypass antibot mechanism of the websource
- Scrapy-Splash implementation for javascript content such as infinite scrolling.
- Model/Pipeline design and development for PostgresQL database.
In settings.py
add the following configuration:
-
Database connection
DATABASE = { 'drivername': 'postgres', 'host': os.environ.get('DB_HOST', 'localhost'), 'port': os.environ.get('DB_PORT', '5432'), 'username': os.environ.get('DB_USERNAME', 'user'), 'password': os.environ.get('DB_PASSWORD', 'pwd'), 'database': os.environ.get('DB_NAME', 'mydb') }
-
Database pipeline
ITEM_PIPELINES = { 'fotocasa.pipelines.PostgresDBPipeline': 330, 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'rotating_proxies.middlewares.BanDetectionMiddleware': 620, }
-
Proxies path ROTATING_PROXY_LIST_PATH = 'fotocasa/proxies.txt'
-
Install the following dependencies from requirements.txt
pip install -r requirements.txt
sqlalchemy psycopg2 scrapy-rotating-proxies
-
Create
setup.py
file at the same level asscrapy.cfg
file with content as:from setuptools import setup, find_packages setup( name='fotocasa', version='1.0', packages=find_packages(), install_requires=[ 'psycopg2', 'sqlalchemy' 'scrapy-rotating-proxies' ], entry_points={'scrapy': ['settings = fotocasa.settings']} )
-
Execute
python setup.py bdist_egg
in folder at the same level asscrapy.cfg
file -
Upload the eggfile into the scrapyd server using:
curl http://localhost:6800/addversion.json -F project=fotocasa -F version=1.0 -F egg=@dist/fotocasa-1.0-py3.7.egg