/BDAPRO

Primary LanguageScala

BDAPRO: Micro service to provide integration between two big datasets

Description

In real-world scenarios, any entity is mentioned in different forms in different data sources. For example, different branches of the same company or different companies generate tables describing the same set of entities. Those tables cannot be joined using IDs columns since they are not generated from the same source and the identical IDs, in this case, does not necessarily means that the two rows are linked in any way. For example, two tables of companies names and information are provided from different sources. Both tables have the following structure (tables might be stored on HDFS in parquet files.):

Column Description
id ID of the company entity or company profile. Careful: the entity IDs are different from the profile IDs, e.g. ID 4 in company_entities may refer to _DFKI _ while ID 4 in company_profiles may refer to Siemens
company_name name of the company
alternative_name some alternative name that is used to refer to the company
website_url URL that may link to the company website
foundation_year year when the company was founded
city name of the city where the company is located
country code of the country where the company is located

Deliverables

  • Design the system and create an algorithm that allows for matching rows of similar values, you might use Flink or Spark.
  • Create a web service (for example in play framework) to return - based on a given company profile (row) - the (sorted) list of legal rows from the other dataset that the input row may refer to.
  • Create an evaluator to evaluate your previous work.
  • Analyse the performance of your implementation. The analysis should be for both the matching strategies performance and accuracy and the web service performance in providing fast response for a big number of users.

Additonal Documents:

Team Members

@paguos

@venkat-443

Mentors

@akaitoua