notoriousno/scrapy-flask

reactor is not waiting till the crawling is completed

Opened this issue · 0 comments

Hi ,

I am trying to use your code to scrape data in real time. I got the urls from the pickle file and passed to the crawler. Data got scraped and returned the list.

My use case says that whenever user submits the request the from UI, app has to load the urls from the pickle file, pass them to crawler and return the data to UI. Below is the code snippet to wait until the crawling is completed.

import json

from flask import Flask
from scrapy.crawler import CrawlerRunner
from scrapy import log, signals

from quote_scraper import QuoteSpider
from googleapiclient.discovery import build
import pickle
import time
from twisted.python import log
from twisted.internet import endpoints, reactor,task,defer


app = Flask('Scrape With Flask')

crawl_runner = CrawlerRunner()      # requires the Twisted reactor to run
data_list=[]                 # store quotes
urls=[]


@app.route('/crawl')
def crawl_for_quotes():
  
  global urls
  global data_list
  
  with open('urls.pkl', 'rb') as handle:
      urls = pickle.load(handle)
             
  eventual=crawl_runner.crawl(QuoteSpider,url_list=urls,data_list=data_list)
  eventual.addBoth(lambda _: reactor.stop())
  reactor.run()
  return json.dumps(data_list)

if __name__=='__main__':
  from sys import stdout

  from twisted.logger import globalLogBeginner, textFileLogObserver
  from twisted.web import server, wsgi
  from twisted.internet import endpoints, reactor

  # start the logger
  globalLogBeginner.beginLoggingTo([textFileLogObserver(stdout)])

  # start the WSGI server
  root_resource = wsgi.WSGIResource(reactor, reactor.getThreadPool(), app)
  factory = server.Site(root_resource)
  http_server = endpoints.TCP4ServerEndpoint(reactor, 9000)
  http_server.listen(factory)

  # start event loop
  reactor.run()
 

Issue is that the script is not getting blocked until the crawling is finished. App is stopped before crawling. Also every time I curl, data is getting appended to the list. Is there any way to instantiate list every time we curl. Because once the app is started , data to the list will be appended every time user submits the request which is happening with the existing code.

Hoping that you would help me.