/awesome-python

python is widely used in data analysis and machine learning, recommend system🐵 also attracts me.

Primary LanguagePython

Python grammer

“Python is an easy to learn, powerful programming language.” Those are the first words of the official Python Tutorial. Python语言广泛应用在数据分析、机器学习、大数据计算(PySpark)等领域,通过读 Fluent Python学习下它的语法。

Python基本变量类型

python一共定义了5个标准的数据类型:NumberStringListTupleDictionary,给变量赋值时不需声明类型,python会自动依据值做判断。

# drink被识别为string,price为float浮点类型
drink = 'café'
price = 10.5
# array中元素类型可不一致,tuple、float、string都可以,获取元素用array[index]
city_info = ['newyork', 23, 10.34, (35.689722, 139.691667)]

遍历array中元素的不同写法,forlambdaarray完成遍历、元素筛选(过滤ascii码大于127的字符):

symbols = ['o', '0', '¢', '£', '¥', '€', '¤']
# listcomps do everything the map and filter functions do
beyond_ascii = [ord(s) for s in symbols if ord(s) > 127]

# use python lambda expression
beyond_ascii = list(filter(lambda c: c > 127, map(ord, symbols)))

python对两个数组计算笛卡尔积,通过两个for语句从array中提取元素,然后进行自由组合,range(value)函数会生成从0~value间的整数数组:

# “Cartesian product using a list comprehension”
colors = ['black', 'white']
sizes = ['S', 'M', 'L']
tshirts = [(color, size) for color in colors for size in sizes]
# a: 0, b: 1, rest: [2, 3, 4]
a, b, *rest = range(5)

function及class的定义

通过def关键来定义函数,不需定义函数的返回类型,function.__doc__能获取函数的说明:

def factorial(n):
    """returns n!"""
    return 1 if n < 2 else n * factorial(n - 1)

# factorial(42): 1405006117752879898543142606244511569936384000000000, function doc: returns n!,
# type(factorial): <class 'function'>
print(f"factorial(42): {factorial(42)}, function doc: {factorial.__doc__}, "
        f"type(factorial): {type(factorial)}")

python中的类由class关键字来定义,其中__init__类似于constructor function,在class定义中@classmethod修饰类函数、@staticmethod修饰静态函数:

class Document():
  WELCOME_STR = 'Welcome! The context for this book is {}.'
  def __init__(self, title, author, context):
    print('init function called')
    self.title = title
    self.author = author
    self.__context = context

python中通过class BOWInvertedIndexEngine(SearchEngineBase)来实现继承,基类class作为参数放入派生类中,__init__(self)函数中先调用父类的构造函数:

class BOWInvertedIndexEngine(SearchEngineBase):
  def __init__(self):
    super(BOWInvertedIndexEngine, self).__init__()
    self.inverted_index = {}

lambda语法实现map-reduce函数,和其它语言一样,匿名函数写法简洁、可读性好:

array = [1, 2, 3, 4, 5]
map_list = map(lambda x: x * 2, array)  # [2, 4, 6, 8, 10]
reduce_value = reduce(lambda x, y: x * y, array)  # 1*2*3*4*5 = 120

并发、多线程数据处理

一般用asynciocreate_task()来创建任务,并通过await等待任务执行完成、或者使用asyncio.gather(*task)等待任务执行完成:

async def metrics():
  """用time()api来测试python代码执行的效率, asyncio.create_task()异步任务"""
  start_time = time.time()
  urls = ['url_1', 'url_2', 'url_3', 'url_4']
  tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
  # for task in tasks:
  # 	await task
  # 另一种写法,asyncio.gather(*tasks)会等到所有task都跑完
  await asyncio.gather(*tasks)
  print(f"total used {round(time.time() - start_time, 2)} s for crawling webpage")

并行执行futures特性,当执行task需获取返回结果时,futures中的方法done(),表示相对应的操作是否完成-True表示完成,False表示没有完成。

def download_all(url_sites):
  with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # solution 2: executor.map()会对sites中的每个url,分别调用download_one函数,max_workers默认用cpu数
    # executor.map(download_one, url_sites)
    to_do = []
    for site in url_sites:
      future = executor.submit(download_one, site)
      to_do.append(future)

    for future in concurrent.futures.as_completed(to_do):
      # executor.submit()后会产生future结果,as_completed()为异步判断是否执行完
      future.result()

python中的多进程组件在multiprocessing包下,使用方式也较为简单,创建多进程池,通过pool.map()执行task

def find_sums(numbers):
  # multiprocessing.Pool()会创建进程池,将cpu_bound函数、数据作为key/value进行计算
  with multiprocessing.Pool() as pool:
    pool.map(cpu_bound, numbers)