/PDFnature

A python PDF scraper for Nature.com.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Nature.com Web Scraper

This is a web scraper built in Python that scrapes content from Nature.com. The user can input the number of pages they want to parse and the category of the content they want to scrape. The contents are then downloaded as pdf files and saved in a new directory on the local machine. Familiarity with the categories on the website in required. This scraper can also be modified to accomodate other websites.

Installation

1) Clone the repository to your local machine: git clone git@github.com:P0L3/PDFnature.git

2) Install the required Python packages: pip install -r requirements.txt

3) Open the terminal and navigate to the project directory: cd path/to/PDFnature

4) Run the scraper.py file: python scraper.py <number of pages> <category>

Note: categories can be seen on Natures index page: https://www.nature.com/siteindex` e.g.: "BDJ In Practice" has link https://www.nature.com/bdjinpractice/ -> bdjinpractice

Contributing

If you want to contribute to this project, feel free to fork the repository and submit a pull request.

License

This project is licensed under the GNU general public license v3.0 - see the LICENSE file for details.

Legal Considerations

Please note that web scraping can raise legal and ethical issues, such as copyright infringement and website terms of use violations. It is the user's responsibility to ensure that their use of this software is in compliance with applicable laws and ethical standards. The authors of this software are not responsible for any misuse or legal consequences arising from the use of this software.

Acknowledgements

This repository was made based on the https://github.com/kgotsosm/nature-web-scraper.git repository.