/LinkedinRDF

Data Scraping on the Linkedin platform to extract data to populate a structured RDF graph starting from an ontology that summarizes the functional scheme of the famous application.

Primary LanguagePython

Linkedin Web Scraping: creating an ontology for Semantic Web

Authors: Lorenzo Mandelli

Università degli Studi di Firenze


The project allows to extract data from the Linkedin application through a Data Scraping phase and subsequently through them populate a structured RDF graph starting from an ontology that summarizes the functional scheme of the famous working application.

To achieve this objective, the extracted data are converted into a structure compatible with an Insert Query expressed in the SPARQL language and are then inserted into the RDF model through the use of the TopBraidComposer application which allows the operation.

The nature of this project is strictly academic and therefore the resulting RDF graph, also available at this Link as .OWL, is small in size. Despite this, it allows the use of SPARQL queries for statistical analysis purposes.

Ontology Schema


Installation

This code was written in Python 3.8. The Selenium library is required for the Data Scraping phase and it can be obtained through pip install selenium.

Usage

In order to run the program execute the file main.py. For the execution it is required that the google browser window used for the web scraping phase is not minimized for the entire duration of the application.

In order to provide login credentials to Linkedin for viewing profiles, the following account has been made available: email = bryansevendeadlysins@outlook.com, password = Arthur123.

In the event of a high number of accesses to the Linkedin account, it may be required to resolve a captcha in order to start the application correctly.

Experiments

The graph is generated starting from a set of LinkedIn profiles resulting from a Google search for a specific category of worker that the software takes as input. To test the software it was decided to start the analysis from the first 10 resulting profiles of python programmers and living near Florence (Google search: "site:linkedin.com/in/" AND "python developer" AND "Firenze").

Limitations

Currently the software manages a conspicuous but limited number of entities present in the Linkedin application. The management of Linkedin Groups, Posts, Videos and Jobs will be implemented in future updates. The outline of the complete ontology is already available at this Link.

Linkedin example