/go-grpc-basic-url-parser

A technical interview task.

Primary LanguageGo

Go-based gRPC Server

This repository is created for a technical interview task.

Note that I have no past Go, gRPC and webpage parsing experience before.

Update

Well, I got rejected. Seems not knowing Go in expert level for a machine learning position is bad. There are several minor problems in this repository (in my point of view obviously, things that can be fixed by one key stroke); however, I won't fix it =) Btw go get is not working, because I am not an expert in Go =)

Introduction

The aim of the task is to write a gRPC server (Nope, I won't give too much detail, just google it) which will serve a "ParseURL" method to parse a given URL's HTML page.

ParseURL method takes a "url" as an input parameter and returns the parsed "title," "thumbnail_url" and "content".

  • The "url" can be either a newspage or a blog page.
  • "title" is the <title> of the page.
  • "thumbnail_url" is an image URL which is parsed from the page as a thumbnail image.
  • "content" is the all text content of the page.

This repository contains:

  • A "mock_parser" folder, which contains parser_mock.go file which is generated by using "mockgen", parser_server_test.go file for implemented unit tests, and ./test_urls/ folder which contains the basic html pages that are created for testing.
  • A "parserproto" folder, which contains the parser.proto file for this task as well as its compiled version for GO which is parser.pb.go
  • A server main code parser_server_main.go
  • A client (for tests) main code parser_client_main.go

How to Install

I did not tested it but go get github.com/hbahadirsahin/go-grpc-basic-url-parser command should be enough for you to directly download this repository. Or you can manually download the repository. Or you can manually copy paste all codes into your own workspace.

How to Run

Assuming you install gRPC and Protocol Buffer related stuffs for Go:

  • Open a command window, if you are against using any kind of IDEs, and type go run parser_server_main.go.
    • You can arrange server's port by using -port argument. Example: go run parser_server_main.go -port=123456
  • Open another command window, and type go run parser_client_main.go.
    • You can change the server address to connect by -address and provide input url by -url arguments. Example: go run parser_client_main.go -address=localhost:123456 -url=https://www.xyz.com
    • As a note, you need to provide full address of gRPC server is running (with IP and Port).
  • If you are using an IDE, just press the run/build/compile whatever button you have for both main.go files.

You do not need to enter specified arguments if you want to check whether the codes are running. The -port argument is 50051, the -address argument is localhost:50051 and the -url argument is a medium blog page by default.

As a note: Since compiling parser.proto was a pain for me (in Windows), I pushed the compiled version, too. So, you do not need to compile it, but you need gRPC related libraries installed for Go to run this repository (For how to install gRPC: https://grpc.io/docs/quickstart/go.html)

Finished Tasks

Update 25.12.2018

  • Extend unit tests.
    • 7 cases are created and tested.
  • More edge case checker and better error handling.
    • While testing the parsing methods, encountered problems with edge cases are handled properly (with bugfixes and understandable error logging).
  • Write an extended README for installing and using this project.

Update 23.12.2018

  • Learn mock and write unit-tests.
    • Mock for client is created by using "mockgen".
    • A basic test is implemented for testing the first unit test I've written in Go.

Update 21.12.2018

  • Client code will take "-address" and "-url" as input argument/parameter

Update 20.12.2018

  • Parser function to parse "content" of the page.
    • 2 more specific parsers are added. By specific, I mean the added parsing functionality is probably not gonna work for many different webpages.
    • A general purpose content parser functionality added. It gets the values of <p> tags first, then <ol> tags and finally <ul> tags. This method does not guarantee the order of the text if the content contains list(s).

Update 19.12.2018

  • Server code will take "-port" as input argument/parameter
  • A basic parser function is implemented to parse text "content" of "Medium Blog" pages. Again, since all pages are different, and I can't find a way to parse them all with a unified method, this function will be updated with 3-4 different page cases at most (and will return empty content for the rest).

Update 18.12.2018

  • Write and compile .proto file to define request and response methods.
  • Parse input URL and get title.
    • In case a page does not contain <title> tag, the code will check every possible title candidate by checking <h1>, <h2> and <h3> tags.
    • If code cannot find any title candidate, it will send a string about it (instead of error, but I will probably change it).
  • Parse input URL and get the first, related image's URL.
    • For 3 possible HTML structure, code check the image urls and its alt attribute. The longest alt attribute value has a high probability of being related to the input page.
    • If images in a page does not have any alt attribute or do not fit the defined structure, it returns the first image found in the page.
  • For dumb-downed, personal tests, a client code has been written.