BDAPRO: Micro service to provide integration between two big datasets

Description

In real-world scenarios, any entity is mentioned in different forms in different data sources. For example, different branches of the same company or different companies generate tables describing the same set of entities. Those tables cannot be joined using IDs columns since they are not generated from the same source and the identical IDs, in this case, does not necessarily means that the two rows are linked in any way. For example, two tables of companies names and information are provided from different sources. Both tables have the following structure (tables might be stored on HDFS in parquet files.):

Column	Description
id	ID of the company entity or company profile. Careful: the entity IDs are different from the profile IDs, e.g. ID 4 in company_entities may refer to _DFKI _ while ID 4 in company_profiles may refer to Siemens
company_name	name of the company
alternative_name	some alternative name that is used to refer to the company
website_url	URL that may link to the company website
foundation_year	year when the company was founded
city	name of the city where the company is located
country	code of the country where the company is located

Deliverables

Design the system and create an algorithm that allows for matching rows of similar values, you might use Flink or Spark.
Create a web service (for example in play framework) to return - based on a given company profile (row) - the (sorted) list of legal rows from the other dataset that the input row may refer to.
Create an evaluator to evaluate your previous work.
Analyse the performance of your implementation. The analysis should be for both the matching strategies performance and accuracy and the web service performance in providing fast response for a big number of users.

Additonal Documents:

Scala and Play Framework features: https://docs.google.com/presentation/d/1Dhj72Vrp7_-HL9taLZpA3UNAxryW5cgdIHGkF-rNu10/edit?usp=sharing
Dabases Comparison: https://docs.google.com/document/d/1QpvsyGh_-HDVRLuwref6imqEApn0pZ8_UcRjxZcLCRs/edit?usp=sharing

Team Members

@paguos

@venkat-443

Mentors