Don't insist on answer component of URL
Opened this issue · 0 comments
opoudjis commented
crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match quora.com/answer/...
needs to be relaxed:
# Get the part of the URL indicating the question title; we will save under this name
m1 = re.search('quora\.com/([^/]+)/answer', url)
# if there's a context topic
m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
filename = added_time + ' '
if not m1 is None:
filename += m1.group(1)
elif not m2 is None:
filename += m2.group(1)
else:
print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
continue
I change the last two lines to:
# blog post
m3 = re.search('quora\.com/([^/]+)', url)
filename += m3.group(1)