JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
evb-gh opened this issue · 18 comments
Description
Running funnel
with load -s settings_USA.yml
gives the following error:
[2021-03-16 18:34:58,123] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2021-03-16 18:34:59,720] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2021-03-16 18:34:59,720] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2021-03-16 18:34:59,882] [INFO] JobFunnel: Done. View your current jobs in demo_job_search_results/demo_search.csv
Environment
- Build: 3.0.1
- macOS 10.14
Would like to debug further but not sure how to do it.
Also receiving this. Monster working fine but Indeed fails every time, even with different search keywords. Using DEBUG logging, I was able to get the URL it was trying to hit and it seemed fine.
Environment:
- Build 3.0.1
- Ubuntu 18.04.4 LTS
- Python 3.9.2
Thanks for opening an issue, I think we have some long outstanding issues with Parsing of the search URL for certain queries, if you are open to sharing your search URLs from logs it would be very helpful to identify what the issue is.
We current have CI for the US Indeed scraper but it only performs a basic search.
Additionally, can you confirm that you are able to obtain results (non advertisement results) for the search you are performing on the Indeed website?
Sure. My jobfunnel has also been failing the Monster scrape the past few days (using crontab to run once daily). I would also try to debug if I could but I'm not very familiar with running python projects and I couldnt figure out how to run from PyCharm with the source 😅
URL: https://www.indeed.com/jobs?q=Software Engineer&l=tulsa%2C+OK&radius=25&limit=50&filter=0
I also used the URL: https://www.indeed.com/jobs?q=Software&l=tulsa%2C+OK&radius=25&limit=50&filter=0
Just to see if maybe the space was throwing things off. That URL also failed.
Ok, yeah looks like we need to improve the url parsing! Can you try instead searching for two separate keywords, like this:
- Software
- Engineer
Oh i see that you tried with a single keyword as well, ok. I think this might be some other issue.
One thing to try is to use current master of this repo. You can do that by installing it in place with, pip install -e <path to this repo>
Ah ok, thanks for being so responsive, we’ll have to take a deeper look.
If you are feeling confident I invite you to break execution in the scraper where we collect the number of pages of results from the search url, I suspect the issue is there since it ends up scraping no jobs.
I would be interested in doing some debugging, but I may need some advice with how I can do so from something like PyCharm (open to another IDE you recommend). This is a tad out of scope for the issue so pardon my intrusion.
I am trying to run JobFunnel-master\jobfunnel_main_.py\ but doing so gets me an import error
Like I mentioned, I'm not super familiar with running python, especially in a project like this so this may be completely the wrong place to try and start running 😅 but if you can point me in the right direction for how I might get to a point where I can set breakpoints and such, I'd be happy to play around.
Unfortunately PyCharm doesn't work for this project due to use of abstract base classes.
The best way to debug is to add a import pdb; pdb.set_trace()
in the code where you would like to debug
then you have access to a complete python interpreter, i.e. pp var_im_interested_in
You should be able to debug modules, such as jobfunnel, in pycharm like this:
https://stackoverflow.com/a/51268846
If anyone reading this that has the time and knowledge can I ask you to write a step by step example of how to debug this code?
I would like to understand how to debug this repo by running it from a local directory with either pyCharm, cli or emacs.
RE pycharm, users have had issues using it with this repository in the past due to the ABC implementation: #90 (comment)
I highly recommend just adding the line import pdb; pdb.set_trace()
anywhere in the base scraper or indeed scraper and playing around with the available methods and variables (pp vars(self)
)
NOTE: to use pdb with multiprocessing.pool
you will additionally want to set the number of workers to 1.
Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?
Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?
totally fine, happy to help!
You should be able to run with test parameters by doing this:
wget https://git.io/JUWeP -O my_settings.yaml
funnel load -s my_settings.yaml
Running funnel load -s my_settings.yaml
doesn't it run the code from /usr/local/bin/funnel
which then executes code form /usr/local/lib/python3.9/site-packages/jobfunnel
?
What I'm trying to do is:
- Clone the repo locally to
~/jobfunnel
- Add
import pdb; pdb.set_trace()
toindeed.py
orbase.py
- Run the code from
~/jobfunnel
with my_settings.yml - Debug
Right i recommend doing this to have a test version of jobfunnel:
- git clone this repo somewhere
- checkout the branch you want to test
virtualenv venv
source venv/bin/activate
pip3 install -e ./jobfunnel
When done you can exit virtualenv with deactivate
Ok so i think the best place to start is indeed.py
line 303 in the current master, query_resp.find
returns None and I believe this is due to the encoding of the the request_html
being incorrect somehow. I'm taking a look as well since I want this to work for everyone :P
<bound m�D������]���nd of <html><body><p>�J ��_�~�ް��уƽ����� O�
���#T��v�r�M����i�7����ϼ���r��v�'�C�F�!�c�W��
i���K��+^6�n�����hy\)������Y���b! j��Z��VH���k����L_���wР�BXk@��9B�N����$|�>L����'�K�w�p�D��%6�c�*� ��,�l���X&l�h@0���%�� �E�r�D\��xP��nȸc�[��C8�qH��_l����V1��-{.��<tl4z>�Jj6���
!K�!�^��B��2�R�����6�u'hǐ��gB��8�����2"���]��|�^�X�%���`�qx7R����\M�j�tR\]N��.bj�Y���n�6Åp�qr �`����7��v���ҪBnr��,�������zٳ���k!��
didn't mean to close this abruptly but I think the encoding was causing this. Please pull the latest changes and try, but this has resolved the issue on my end.