ce_interview_project: A Python repository from toddas90

CE Interview Project Submission

This is my submission for the CE Intern position.

I would like to preface this by saying that I spent more time researching the problem than actually coding. I'd never been given a problem like this before, so I wasn't even sure what this type of problenm was called.

I went with a count-based approach because I thought that adding in meaning would result in some false positives. The "Bag of Words" approach seemed nice and easy, but it looked like they had a few shortcomings.

The reason that I decided to not go with the IDF approach is because I thought that it would struggle with short documents. I wanted this example to work on more than one document.

The reason that I went with a BM25 formula is because it seemed like the best way to tackle a problem like this. From my limited research, it seems like this formula is used by search engines to rank documents. This article was really helpful in deciding which way to go.

I decided to use the BM25L formula to rank the lines because it addresses the problem of document length bias unlike the standard BM25 formula.