Warning
The primary purpose of this repository is to learn. It is important to note that web crawling may be considered illegal, and therefore, it is crucial to refrain from exerting any pressure or engaging in unauthorized activities on websites.
tanpenggood/xiaohongshu is a crawling application designed to extract data from xiaohongshu page.
crawling data range: only parsed data in window.__INITIAL_STATE__
of xiaohongshu page.
- windows 11
- jdk 1.8
- maven 3.6.0
Run com.itplh.xhs.XhsCrawlabUI
See:
-
crawl notes
reference test class:
com.itplh.xhs.XhsCrawlabTest
xiaohongshu/src/test/java/com/itplh/xhs/XhsCrawlabTest.java
Lines 9 to 21 in 075cdda
-
save note to excel
com.itplh.xhs.excel.ExcelGenerator.writeNotes2Excel(UserInfo userInfo)
xiaohongshu
├── src/main
│ ├── java/com.itplh.xhs
│ │ ├── constant
│ │ ├── domain
│ │ ├── excel # generate excel, use easyexcel
│ │ ├── parse # parse json data (parse window.__INITIAL_STATE__)
│ │ ├── ui # build ui, use javafx
│ │ ├── util
│ │ ├── XhsCrawlab # core api
│ │ └── XhsCrawlabUI # ui
│ └── resources
│ ├── desktop # response data of desktop access xiaohongshu
│ ├── mobile # response data of mobile access xiaohongshu
│ └── logback.xml # log config
├── src/test/java # unit test
├── pom.xml
└── README.md
- Java:1.8
- auto-browser-script-engine:1.1.2
- jsoup:1.15.2
- fastjson2:2.0.15
- lombok:1.18.12
- logback-classic:1.2.3
- junit:4.13
mvn clean package -Dmaven.test.skip=true