nickirk/immo

404 on message sending page (wg-gesucht)

tbrodbeck opened this issue · 10 comments

Screenshot 2020-07-12 at 22 42 02
I think they have changed the structure of the link structure (including the title string of the offer as on the left side). If someone could confirm that I might implement a fix for this!

Yes I can confirm that. Initially the link format is fixed, with the offer id at the end. Now they also include the name of the offer, which differs from offer to offer. The solution will be find the "NACHRICHT SENDEN" button in the source, and click on that directly. There are some examples in the code already on how to find a button. If you could follow the example and fix it, it would be great.
Screenshot 2020-07-15 at 12 52 59

Allright! I just looked into the code.
Can you maybe tell me how to access the urls of the offers? I think the scaper only returns the IDs - I am not familiar with scrapy.

replace this line in wg-gesucht-spider.py

for quote in response.css('div.offer_list_item::attr(data-id)').extract():

with

for quote in response.css('h3.truncate_title a::attr(href)').extract():

you'll have all the url's of the offers. But please pay attention to the following two points:

  1. the first few url's sometimes are advertisement from companies like airbnb, so make sure to rule out them from the real offers. This can be done by matching keywords like "airbnb" in the link, if you find it, discard it, i.e. in pseudo-code
for quote in response.css('h3.truncate_title a::attr(href)').extract():
    if inside of quote contains airbnb:
        continue
    yield {
                    "data-id": quote
              }
  1. in submit_wg.py file, replace line 11,
    driver.get('https://www.wg-gesucht.de/nachricht-senden.html?message_ad_id='+ref)

with

    driver.get('https://www.wg-gesucht.de/nachricht-senden/'+ref)

will directly directs you to the message sending page, where the ref variable contains the url's you get from the scraper and it looks like this : 1-zimmer-wohnungen-in-Stuttgart-Bad-Cannstatt.8106474.html

Good luck hacking. If you need more help, don't hesitate to ask me. I will be happy to see you make this thing work again.

Thanks you!
That was already really helpful. After fiddling a bit around with selenium I got it working again!

But still there is one odd bug:
Scrapy does not scrape the correct webpage. I am not sure why that is. Somehow it does apply the filters (e.g. https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2) all correct - but it does not select the correct location. So I will get all results spread over Berlin of regions I had filtered out.
I could reproduce this issue (irregularly and seldomly) by opening the filtered link in a private window.

Maybe you have an idea why that happens and what I could look into?

Have compared the actual results on the webpage and the results scrapy got? Maybe it is not a problem on the scrapy side but rather the website itself?

Also did you update the link inside of the spider, which should be under your working directory?

Can you please try to press reload in your browser and tell me if the website changes?

Oh okay then the filter actually works in this case (I filtered for Mitte and FHain,XBerg).

The issue that sometimes happens (and I think this happens to scrapy every time) is that the location filter is not loaded correctly.
After I have simply refreshed the page on my iPad and then the filter "STADTTEILE" is then loaded:
RPReplay-Final1594902713

Interesting, could you make a pull request so that I can merge your code? I can then look into it by running the script?

I just reproduced it with beautiful soup:

import bs4
import requests

baseUrl = 'https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2'
page = requests.get(baseUrl)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
for h3 in soup.find_all('h3',class_='truncate_title'):
  for a in h3.find_all('a'):
    print(a['href'])
https://airbnb.pvxt.net/c/1216694/264339/4273?u=www.airbnb.de/s/Berlin/homes&p.checkin=2020-08-01&p.checkout=2020-08-31&sharedid=notemp_Berlin_1_desk&param1=de_wg_4
wg-zimmer-in-Berlin-Dahlem.4044959.html
wg-zimmer-in-Berlin-Pankow.7771731.html
wg-zimmer-in-Berlin-Mitte.8098905.html
wg-zimmer-in-Berlin-Neukoelln.5373875.html
wg-zimmer-in-Berlin-Koepenick.4691030.html
wg-zimmer-in-Berlin-Charlottenburg.8123501.html
wg-zimmer-in-Berlin-Charlottenburg.8110089.html
wg-zimmer-in-Berlin-Charlottenburg.8095287.html
wg-zimmer-in-Berlin-Zehlendorf.7384968.html
wg-zimmer-in-Berlin-Friedrichshain-Kreuzberg.8127431.html
wg-zimmer-in-Berlin-Lichtenberg.8107841.html
wg-zimmer-in-Berlin-MITTE.6126245.html
wg-zimmer-in-Berlin-Neukoelln.6365392.html
wg-zimmer-in-Berlin-Neukoelln.8042369.html
wg-zimmer-in-Berlin-Neukoelln.4934132.html
wg-zimmer-in-Berlin-Friedrichshain.8122261.html
wg-zimmer-in-Berlin-Mitte.8130226.html
wg-zimmer-in-Berlin-Adlershof.5626460.html
wg-zimmer-in-Berlin-Friedrichshain.3514132.html
wg-zimmer-in-Berlin-Zehlendorf.8127837.html