Goudan(狗蛋)is a tunnel proxy, it's support all tcp proxy(theoretically), such as http,https,socks. By default, goudan crawl free proxies from some websites. So, you can use it out of box.
When I develop a spider to crawl some web sites, most time they have some defense measures.
So, I must change my IP to crawl it at a moment.
The best way is set a proxy address for a web requests libray, such as "Requests","urlib", "aiohttp" and so on.
But, I need write those code in every project. And I want't to do this.
This why I start this project.
docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan
or
docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan --log_level 10 -r 10 -i 60 -t socks
If you want see some help documents:
docker run daoye/goudan -h
git clone https://github.com/daoye/goudan.git
git checkout develop
cd goudan
python3 main.py
The best way is use virtualenv.
If you have some other proxies, you can add them to the proxy pool.
To do this, you must create a new spider. For example:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
class MySpider():
def run(self):
return [
{"host": "127.0.0.1", 'port': 1080, 'type': 'socks', 'loc': 'jp'},
{"host": "127.0.0.1", 'port': 1087, 'type': 'http', 'loc': 'jp'}
]
This spider return an array include some proxies.
Anyway, you can collect some proxies from other web site:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from lxml import etree
from spiders.baseSpider import BaseSpider
import logging
class MySpider(BaseSpider):
def __init__(self):
BaseSpider.__init__(self)
# These are target urls.
self.urls = [
'http://www.xxx.xxx/'
]
# This means crawl per 10 minutes.
self.idle = 10 * 60
def _parse(self, results, text):
# parse the "text"
# then add it to "results"
for r in rows:
results.append({
'host': r.ip,
'port': r.port,
'type': 'http',
'loc': 'cn'
})
A proxy item is a dictionary, it has these key:
host: The ip address.
port: The port, it must an integer.
type: The proxy's type, it can be: http, https, http/https,socks.
loc: Location of proxy(not imoprtant, use for feature).
When you create a spider, you must modify the "setting.py"
Open the file "setting.py", then find the "spiders" variable, add you spider in it:
spiders = [
...
'spiders.mySpider.MySpider' # This is you spider.
]
Enjoy!
MIT License