/Movie-Recommendation

本项目根据豆瓣用户影评和电影影评进行影视推荐

Primary LanguageC++

Movie-Recommendation

说明

​ 本影视推荐系统根据电影影评获取电影属性,根据用户自身的影评生成用户属性。再根据用户属性和电影属性进行匹配推荐。

instruction

​ The movie recommendation system obtains movie attributes according to movie reviews, and generates user attributes according to user's own movie reviews. Then matching recommendation is made according to user attributes and movie attributes. module

模块

语料库&语料库爬取模块

​ 语料库是通过语料库爬取模块爬取在豆瓣上的用户影评和电影影评而生成的。

数据预处理模块

​ 数据预处理模块根据爬取的语料生成电影分类词典和影视词库。

影视推荐模块

​ 此模块根据预处理模块数据用以确定用户和电影属性评分从而推荐。

module

corpus&&corpus acquirition

​ Corpus is generated by user reviews and movie reviews which are crawled on the bean by the corpus crawling module.

Data Preprocessing Module

​ Data preprocessing module generates movie classification dictionary and movie vocabulary based on the crawled corpus.

Movie Recommendation Module

​ According to the data of the pre-processing module, this module can determine the user and movie attribute score and recommend it.

整体流程

​ 信息收集阶段通过豆瓣网站获取影评、从百科定义中获取各个类型电影的定义。之后对百科定义进行分词、去除停用词、并且构建电影分类词典。豆瓣影评也是通过分词、去除停用词后构建评论词库。最后通过这两个词库生成属性评分,再根据属性评分进行匹配推荐。

Overall process

​ In the information gathering stage, film reviews are obtained through Douban website and definitions of various types of movies are obtained from Encyclopedia definitions. Then the encyclopedia definition is segmented, stop words are removed, and a movie classification dictionary is constructed. Douban Movie Review also constructs a commentary thesaurus through word segmentation and deletion of stop words. Finally, attribute scores are generated through these two lexicons, and matching recommendation is made according to attribute scores.

流程图

环境

  • 整体所需要的环境是:python2、python3
  • 其中用到的库有requests库、bs4库、fake_useragent库、pkuseg库
  • 另外还需要pe文件执行环境

environment

  • The overall environment required is: Python 2, Python 3
  • The libraries used are requests library, BS4 library, fake_useragent library and pkuseg library.
  • You also need the PE file execution environment

语料库

  • 本语料库中分为“电影影评”和“用户影评”

  • 其中“用户影评”为一个用户近期以来的十条评论,用以确定用户的属性

  • 其中“电影影评”为一个电影的前五页的评论,用以确定电影的属性

    如果需要增加数据,请使用user_reviews.py和movie_reviews.py爬取数据
    环境:
    python2
    requests库
    fake_useragent库(可选)

corpus

  • The corpus is divided into "film reviews" and "user reviews"

  • Among them, "User Movie Review" is a user's recent ten comments to determine the user's attributes.

  • Among them, "Film Review" is the first five pages of a film's commentary to determine the nature of the film.

    If you need to add data, use user_reviews.py and movie_reviews.py to crawl data. environment:
    python2
    requests
    fake_useragent(optional)

    爬虫程序说明

    其中proxies可自行更改可用爬虫代理,所爬取到的数据存入的文件的文件名,请将open的第一个参数改为自己所需要的名称。如果需要更改爬取数目以增加识别精度,请修改final_page变量为想要的页数(用户评论一页10条,电影评论一页20条)。 本脚本文件使用方法可以参考youtube视频:爬虫演示

    Reptilian Program Description

    Proxies can change the file name of the file in which the crawler agent is available. Please change the first parameter of open to the name you need. If you need to change the number of crawls to increase recognition accuracy, change the final_page variable to the number of pages you want (10 for user reviews and 20 for movie reviews). Use of this script file can refer to YouTube Video:Crawler Demo

    demo

    ​ 爬取电影影评(Climbing Movie Review)爬取电影影评 ​ 爬取结果(Crawling results)爬取结果 ​ 爬取用户影评(Climbing User Movie Review)爬取用户影评 ​ 爬取用户影评结果(Crawling User Movie Review Results)爬取用户影评结果

语料说明

来源 作用 数目
用户评论 豆瓣,同一用户近期评论 用以确定用户属性 10条
电影评论 豆瓣,同一电影前5页评论 用以确定电影属性 5页每页20条

每条评论之间以等号串进行分隔。

Corpus Description

Source role number
User comments Douban, the same user's recent comments User attributes 10
Film Review Movie Review Douban, the first five pages of the same movie Used to determine movie attributes 5*20

Each comment is separated by an equal sign string.

版权说明

​ 本语料库出于非商业目的,如果有侵权,请在issue下面留言。

Copyright Notes

​ This corpus is for non-commercial purposes. If there is any infringement, please leave a message under issue.

预处理模块

Data Pre-Processing文件夹中包含5个自动化脚本:

  • seg.py 单一文件分词脚本

  • clean.py 去除停用词脚本

  • dictionary.py 构建词典脚本

  • count.py 词数统计脚本

  • whileseg.py 批量分词脚本

    脚本使用方法可以见:Data Pre Processing预处理演示

Preprocessing module

The Data Pre-Processing folder contains five automation scripts:

  • Seg.py:single file word segmentation script

  • Clean.py:removes stop-word scripts

  • Dictionary.py:Building Dictionary Scripts

  • Count.py:Word Number Statistics Script

  • Whleseg.py:Batch Word Segmentation Script

    Demonstration of script usage can be seen as follows:Data Pre Processing预处理演示

    所需环境

    python3版本,需要实现安装pkuseg库。

    Required environment

    Python 3 and later. The pkuseg library needs to be installed.

    demo

    ​ 影评清洗(Film review cleaning)影评清洗
    ​ 影评清洗结果(Result of film review cleaning)影评清洗结果
    ​ 电影定义(Finding the Definition of Film Type)电影定义
    ​ 电影分类词典(Constructing a Dictionary of Film Classification)电影分类词典

推荐模块(Recommendation module)

​ 使用方法及视频demo可以见(Demonstration of usage method can be seen as follows):Recommendation推荐过程演示

推荐模块流程

​ 输入时movie-word和user-word,也就是电影评论词库和用户评论词库。输出就是用户和电影的属性以及推荐电影。

​ When input, movie-word and user-word are movie commentary thesaurus and user comment thesaurus. Output is the attributes of users,movies and the recommended movies.

所需环境

​ pe文件执行环境。

Required environment

​ PE file execution environment

demo

​ 电影-流浪地球电影-流浪地球 ​ 电影-夏洛特烦恼电影-夏洛特烦恼 ​ 用户-叶子阿姨用户-叶子阿姨
​ 用户-彩蛋君用户-彩蛋君

使用说明

​ 本程序使用时需要运行movie_attr.bat用以获取电影属性评分,运行user_attr.bat用以获取用户属性评分。

注意:使用此批处理文件时一定需要预装好整体所需环境,否则会失败!!

Instructions

​ This program needs to run movie_attr.bat to get the movie attribute score, and user_attr.bat to get the user attribute score.

Note: When using this batch file, you must pre-install the whole environment, otherwise you will fail!!

致谢

参考文献

[1]王侨云,朱广丽,张顺香.基于词间距和点互信息的影评情感词库构建[J].阜阳师范学院学报(自然科学版),2019,36(02):40-46.
[2]王婷婷.字符串模糊匹配算法的探讨[J].现代计算机(专业版),2012(01):12-15.
[3]S_H-A_N.基于情感词典的情感分析[EB/OL].https://blog.csdn.net/lom9357bye/article/details/79058946,2018-1--19.
[4]刘鹏.利用网络爬虫技术获取他人数据行为的法律性质分析[J].信息安全研究,2019,5(06):548-552.
[5]黄克敏.网站信息安全之反爬虫策略[J].保密科学技术,2018(10):62-63.