/BioPersianWikiAnalyzer

Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor

Primary LanguageJupyter NotebookMIT LicenseMIT

BioPersianWikiAnalyzer

The Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor is a Python project designed to systematically extract and preprocess information from the bioinformatics section of the Persian Wikipedia. The goal of this project is to create a structured dataset for further analysis and research in the field of bioinformatics within the Persian-speaking community.

Project Components:

  1. Web Crawler: The web crawler is responsible for navigating the Persian Wikipedia bioinformatics page, systematically retrieving the relevant textual content. Built using Python's web scraping libraries, such as BeautifulSoup and requests, the crawler ensures the extraction of comprehensive and accurate information.

  2. Custom Preprocessing Functions: To enhance the quality of the extracted text, custom preprocessing functions have been developed. These functions include tasks such as text cleaning, removal of irrelevant characters, and normalization. Additionally, the preprocessing pipeline addresses language-specific challenges present in Persian text, ensuring the accuracy of subsequent analyses.

  3. Information Extraction: After preprocessing, the project extracts essential general information from the text. This includes statistics such as the total number of words, sentences, and paragraphs. The extracted information serves as a foundation for understanding the scope and content of the bioinformatics page.

  4. Part-of-Speech (POS) Tagging: Leveraging Natural Language Processing (NLP) techniques, the project utilizes part-of-speech tagging to categorize words in the text based on their grammatical roles. This information provides insights into the linguistic structure of the bioinformatics content, aiding in further semantic analysis.

Key Features:

  • Scalability: The crawler is designed to handle various page structures within the bioinformatics section, ensuring scalability as Wikipedia pages evolve.

  • Custom Preprocessing: The preprocessing functions are tailored to the unique characteristics of Persian text, ensuring accurate and meaningful analysis.

  • Statistical Analysis: The project provides basic statistical insights into the text, allowing researchers to quickly grasp the content's scope.

  • Part-of-Speech Tagging: Linguistic analysis through POS tagging adds an additional layer of information, facilitating more in-depth exploration of the bioinformatics content.

Applications:

This project serves as a valuable resource for researchers, educators, and enthusiasts interested in exploring and understanding the Persian-language content related to bioinformatics on Wikipedia. The structured dataset and linguistic analysis open avenues for further research and contribute to the growing body of knowledge in bioinformatics within the Persian-speaking community.