/goudan

Goudan(狗蛋)is a tunnel proxy, it's support all tcp proxy(theoretically), such as http,https,socks. By default, goudan crawl free proxies from some websites. So, you can use it out of box.

Primary LanguagePythonMIT LicenseMIT

goudan(狗蛋)

Goudan(狗蛋)is a tunnel proxy, it's support all tcp proxy(theoretically), such as http,https,socks. By default, goudan crawl free proxies from some websites. So, you can use it out of box.

Why do this

When I develop a spider to crawl some web sites, most time they have some defense measures.

So, I must change my IP to crawl it at a moment.

The best way is set a proxy address for a web requests libray, such as "Requests","urlib", "aiohttp" and so on.

But, I need write those code in every project. And I want't to do this.

This why I start this project.

How to use

Use by docker(Recommend)

docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan

or

docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan --log_level 10 -r 10 -i 60 -t socks

If you want see some help documents:

docker run daoye/goudan -h

From source(need python3.7)

git clone https://github.com/daoye/goudan.git
git checkout develop
cd goudan
python3 main.py

The best way is use virtualenv.

Add your proxies

If you have some other proxies, you can add them to the proxy pool.

To do this, you must create a new spider. For example:

#!/usr/bin/env python
# -*- coding:utf-8 -*-


class MySpider():
    def run(self):
        return [
            {"host": "127.0.0.1", 'port': 1080, 'type': 'socks', 'loc': 'jp'},
            {"host": "127.0.0.1", 'port': 1087, 'type': 'http', 'loc': 'jp'}
        ]

This spider return an array include some proxies.

Anyway, you can collect some proxies from other web site:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


from lxml import etree
from spiders.baseSpider import BaseSpider
import logging

class MySpider(BaseSpider):
    def __init__(self):
        BaseSpider.__init__(self)

        # These are target urls.
        self.urls = [
            'http://www.xxx.xxx/'
        ]

        # This means crawl per 10 minutes.
        self.idle = 10 * 60 

    def _parse(self, results, text):
        # parse the "text"
        # then add it to "results"

        for r in rows:
            results.append({
                'host': r.ip,
                'port': r.port,
                'type': 'http',
                'loc': 'cn'
            })

A proxy item is a dictionary, it has these key:

host: The ip address.

port: The port, it must an integer.

type: The proxy's type, it can be: http, https, http/https,socks.

loc: Location of proxy(not imoprtant, use for feature).

When you create a spider, you must modify the "setting.py"

Open the file "setting.py", then find the "spiders" variable, add you spider in it:

spiders = [
    ...
    'spiders.mySpider.MySpider'  # This is you spider.
]

The end

Enjoy!

License

MIT License