Tumblr Personal Post Scraper
Scrape user uploaded content from a Tumblr blog. Tumblr provides no natural mechanism for viewing user uploaded content on their website.
example scrape https://support.tumblr.com:
New In V1.2.0
- Request Throttling
- Single file executables including installers (DMG, NSIS)
- Faster UI (React PureComponent and Redux Integration)
- New Icon
- Smaller JavaScript bundle
Download V1.2
Contribution
Clone the rep:
git clone https://github.com/lluisrojass/tumblr-scraper.git
cd tumblr-scraper
npm install
Run npm run watch
to execute a development watchify script which monitors files and re-builds the bundle file upon noticing a change. The bundle file will not be present upon cloning and will require generation regardless. The other method for bundle file generation is running npm run min
which ouputs a minified and production ready bundle. While in development, use the npm run simulate
command to simulate the app with development addons (chrome devtools and electron-reload) which are useful for logging and debugging.
Tools/Libraries to be aware of:
- Electron Framework
- Browserify
- Babel
- Redux
- React
- htmlparser2
- Transform class babel properties plugin
- ES2015 babel preset
- React babel preset
What is Request Throttling?
When scraping blogs with large frequency and density of original posts the application could become unresponsive or a significant CPU burden. To help alleviate this possibility, throttling was introduced. When turned on (which is the default behavior) the application will keep track of the pending image load which the application has yet to fulfill and could temporarily delay the continuation of the requests loop. This provides breathing time between page requests which prevent a potentially overwhelming rush of front-end workload. While all other application state (blogname, post types) has to be pre-set before a scrape can begin, throttling can be turned on/off anytime.
Like what you see? consider favoriting or following the project :)
License
MIT