Python project to download web pages and search them.
Don't commit actual input file. In .gitignore ignored oovwords.csv
python3 -m get_suggested_spellings -in_dir "data/input" -in_file "oovwords.csv" -out_dir "data/output" -out_file "suggested_spelling_output.csv"
python3 -m get_suggested_spellings
python3 -m get_top_hits
python3 -m concatenate_regex
python3 -m download_web
Storing pages enables searching multiple times without re-downloading.
Search is similar to Unix/Linux grep command
Note: Suggest use different values for download_directory and output file directory.
Otherwise subsequent searches might accidentally search an output file.
python3 -m search_web -expression "ython" -search_directory "data/downloads" -out_dir "data/output" -out_file "websearcher_output.txt"
python ./websearcher/search_web.py
http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python
http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
http://stackoverflow.com/questions/11804497/python-3-web-scraping-and-javascript-oh-my?rq=1
| free use is limited, then pay | https://code.google.com/archive/p/google-api-spelling-java/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
http://stackoverflow.com/questions/11331071/get-class-name-and-contents-using-beautiful-soup
Python 3 project to search local file directories
https://github.com/beepscore/searcher
Many web requests return a combination of HTML and Javascript. In these cases, we can use a web browser to run the javascript and get more html.
Use selenium webdriver to load the page in a browser. Have selenium wait until the browser executes the javascript and gets more html. Then parse and search the page e.g. with Beautiful Soup.
http://stackoverflow.com/questions/11331071/get-class-name-and-contents-using-beautiful-soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
Can use Anaconda or virtualenv.
cd websearcher
Supply path to websearcher, e.g.
source ./websearcher/venv/bin/activate
venv\Scripts\activate
| To run tests, open terminal shell. | cd to project directory. Run tests via python command or bash script.
| Runs all test modules. | Works on OS X. On Windows may work with Cygwin, I don't know.
$ ./bin/run_tests
This command lists and tests all modules
python3 -m unittest discover -s tests/
| Alternatively, can supply test module names as args. | This command lists and tests all modules except web_downloader_arg_reader and web_searcher_arg_reader.
python -m unittest tests.test_page_reader tests.test_file_writer tests.test_web_downloader tests.test_web_searcher
| Attempting to run test_web_downloader_arg_reader and test_web_searcher_arg_reader has problem with arguments for unittest and for argparse. | e.g. python -m unittest discover says "unrecognized arguments: discover" and wants the argparse arguments. | TODO: Consider alternative solutions. | http://stackoverflow.com/questions/35270177/passing-arguments-for-argparse-with-unittest-discover
use class_ not Python keyword class
oovlist.csv
File from Windows had line endings that show as ^M in vim. Changed to Unix line endings. http://stackoverflow.com/questions/811193/how-to-convert-the-m-linebreak-to-normal-linebreak-in-a-file-opened-in-vim at vim command line type as below, including ^V and ^M
:%s/<Ctrl-V><Ctrl-M>/\r/g
Select within desired anaconda environment, e.g.
> Python 3.6.1 (\~/anaconda/envs/beepscore/bin/python)
NOTE: On Windows, may need to click "eye" icon to show hidden files e.g.
C:\Users\KLittle\AppData\Local\Continuum\anaconda3\envs
If using Poetry, select within desired virtual environment, e.g.
> ~/Library/Caches/pypoetry/virtualenvs/websearcher-NBsQj66t-py3.7/bin
select add content roots to python path select add source roots to python path
can leave this blank
beepscore02:websearcher stevebaker$ conda activate beepscore
C:\Users\KLittle\AppData\Local\Continuum\anaconda3\Scripts\activate my_env_name
Notice command prompt shows anaconda environment is active
(beepscore) beepscore02:websearcher stevebaker$
(beepscore) beepscore02:websearcher stevebaker$ which python
/Users/stevebaker/anaconda/envs/beepscore/bin/python
(beepscore) beepscore02:websearcher stevebaker$ python --version
Python 3.6.2 :: Continuum Analytics, Inc.
In shell run conda deactivate
(beepscore) beepscore02:websearcher stevebaker$ conda deactivate
https://python-poetry.org/docs/basic-usage/
cd to virtual environment e.g.
cd /Users/stevebaker/Library/Caches/pypoetry/virtualenvs/websearcher-NBsQj66t-py3.7/bin source activate
Notice command prompt shows virtual environment is active
(websearcher-NBsQj66t-py3.7)
Selenium version 3 needs a driver to launch a browser.
https://www.seleniumeasy.com/selenium-tutorials/launching-firefox-browser-with-geckodriver-selenium-3 https://github.com/mozilla/geckodriver
install via homebrew
brew install geckodriver
Then in python file browser = webdriver.Firefox()
2016-10-23 Firefox with current geckodriver works, but logs warning 'NoneType' object has no attribute 'path'
install via homebrew
brew install chromedriver
Then in python file browser = webdriver.Chrome()
2016-10-23 Chrome with chromedriver, log doesn't show a warning
https://stackoverflow.com/questions/38876281/anaconda-selenium-and-chrome
In terminal program "anaconda prompt" Activate desired conda environment e.g.
C:\Users\KLittle\AppData\Local\Continuum\anaconda3\Scripts\activate my_env_name
Then to install
conda install -n my_env_name -c conda-forge python-chromedriver-binary