This repository is created for a technical interview task.
Note that I have no past Go, gRPC and webpage parsing experience before.
Well, I got rejected. Seems not knowing Go in expert level for a machine learning position is bad. There are several minor problems in this repository (in my point of view obviously, things that can be fixed by one key stroke); however, I won't fix it =) Btw go get
is not working, because I am not an expert in Go =)
The aim of the task is to write a gRPC server (Nope, I won't give too much detail, just google it) which will serve a "ParseURL" method to parse a given URL's HTML page.
ParseURL method takes a "url" as an input parameter and returns the parsed "title," "thumbnail_url" and "content".
- The "url" can be either a newspage or a blog page.
- "title" is the
<title>
of the page. - "thumbnail_url" is an image URL which is parsed from the page as a thumbnail image.
- "content" is the all text content of the page.
This repository contains:
- A "mock_parser" folder, which contains
parser_mock.go
file which is generated by using "mockgen",parser_server_test.go
file for implemented unit tests, and./test_urls/
folder which contains the basic html pages that are created for testing. - A "parserproto" folder, which contains the
parser.proto
file for this task as well as its compiled version for GO which isparser.pb.go
- A server main code
parser_server_main.go
- A client (for tests) main code
parser_client_main.go
I did not tested it but go get github.com/hbahadirsahin/go-grpc-basic-url-parser
command should be enough for you to directly download this repository. Or you can manually download the repository. Or you can manually copy paste all codes into your own workspace.
Assuming you install gRPC and Protocol Buffer related stuffs for Go:
- Open a command window, if you are against using any kind of IDEs, and type
go run parser_server_main.go
.- You can arrange server's port by using
-port
argument. Example:go run parser_server_main.go -port=123456
- You can arrange server's port by using
- Open another command window, and type
go run parser_client_main.go
.- You can change the server address to connect by
-address
and provide input url by-url
arguments. Example:go run parser_client_main.go -address=localhost:123456 -url=https://www.xyz.com
- As a note, you need to provide full address of gRPC server is running (with IP and Port).
- You can change the server address to connect by
- If you are using an IDE, just press the run/build/compile whatever button you have for both main.go files.
You do not need to enter specified arguments if you want to check whether the codes are running. The -port
argument is 50051, the -address
argument is localhost:50051 and the -url
argument is a medium blog page by default.
As a note: Since compiling parser.proto
was a pain for me (in Windows), I pushed the compiled version, too. So, you do not need to compile it, but you need gRPC related libraries installed for Go to run this repository (For how to install gRPC: https://grpc.io/docs/quickstart/go.html)
- Extend unit tests.
- 7 cases are created and tested.
- More edge case checker and better error handling.
- While testing the parsing methods, encountered problems with edge cases are handled properly (with bugfixes and understandable error logging).
- Write an extended README for installing and using this project.
- Learn mock and write unit-tests.
- Mock for client is created by using "mockgen".
- A basic test is implemented for testing the first unit test I've written in Go.
- Client code will take "-address" and "-url" as input argument/parameter
- Parser function to parse "content" of the page.
- 2 more specific parsers are added. By specific, I mean the added parsing functionality is probably not gonna work for many different webpages.
- A general purpose content parser functionality added. It gets the values of
<p>
tags first, then<ol>
tags and finally<ul>
tags. This method does not guarantee the order of the text if the content contains list(s).
- Server code will take "-port" as input argument/parameter
- A basic parser function is implemented to parse text "content" of "Medium Blog" pages. Again, since all pages are different, and I can't find a way to parse them all with a unified method, this function will be updated with 3-4 different page cases at most (and will return empty content for the rest).
- Write and compile
.proto
file to define request and response methods. - Parse input URL and get title.
- In case a page does not contain
<title>
tag, the code will check every possible title candidate by checking<h1>
,<h2>
and<h3>
tags. - If code cannot find any title candidate, it will send a string about it (instead of error, but I will probably change it).
- In case a page does not contain
- Parse input URL and get the first, related image's URL.
- For 3 possible HTML structure, code check the image urls and its
alt
attribute. The longestalt
attribute value has a high probability of being related to the input page. - If images in a page does not have any
alt
attribute or do not fit the defined structure, it returns the first image found in the page.
- For 3 possible HTML structure, code check the image urls and its
- For dumb-downed, personal tests, a client code has been written.