aimacode/aima-javascript

22-Natural Language Processing

prakamya-mishra opened this issue Β· 36 comments

I would like to work on Ch - Natural Language Processing.
Should we talk on what and how to implement in this section?

@redblobgames and @Rishav159 I would need your point of view?

Great! I'll mark chapter 22 for you in issue #27.

I actually don't know much about this topic (I didn't study AI from this textbook). There's a diagram and a table in the book that could be good starting points, but I don't have a lot of good ideas about what visualizations would be useful.

ok thank you @redblobgames . I will update you with what all visualisations will be useful for this chapter.

Ch - 22 Natural Language Processing:

  • Language Models

  • Text Classification

  • Information Retrieval

  • Information Extraction

I will start with Language Models

Great! What kind of visualization do you think would be useful for language models?

I am thinking about it and will let you know when i get a good idea.

If you don't find any ideas, take a look at the Python AIMA project (aimacode/aima-python). They have a Python Notebook that has an English/German language recognizer. The notebook (nlp_apps.ipynb) has only four examples, but if you edit the notebook you can try other text. It's a neat little example that would be fun to make interactive, maybe even updating every time you type a letter.

For aima-javascript though we'd also want visualizations showing how the algorithm actually works. Maybe something that shows all the bigrams for English and German and the input text. Trigrams are too much to show but bigrams are only 26 letters x 26 letters which is not too big. This could be a β€œheat map” matrix like this.

These visualizations are experimental. We don't know which ones will work. So that means sometimes you have to try several different ones and then throw some of them away. Sometimes you have to throw them all away because nothing you try worked and then you try some other part of the chapter :(

They also have a notebook for chapter 22 β€” text.ipynb β€” that might give you some ideas.

@redblobgames thank you these suggestions .Sorry i am not able to work on this because my exams are coming, but i will immediately start working on it as soon as it gets over.

@prakamya-mishra No worries! There is no hurry here.

@redblobgames for N-gram character models can i make a section like this link ?

I have added character level unigram . @redblobgames if you like it then i will also implement character level bigram and trigram , word level unigram,bigram and trigram.

It looks like this:
screen shot 2017-10-11 at 7 59 21 pm

Sounds good! For the visuals you might try something like <kbd>H</kbd> <kbd>i</kbd> <kbd> </kbd> <kbd>t</kbd> <kbd>h</kbd> instead of using the | character.

@redblobgames Thanks for the input i will do that,
I will use this, and i will try to add character level bigram and trigram , word level unigram,bigram and trigram also and give a PR.

@redblobgames i don't think that this looks good. what do you think?
screen shot 2017-10-12 at 12 27 15 am

What about this?
screen shot 2017-10-12 at 12 41 51 am

@prakamya-mishra Yes, the black background doesn't look good but I think it's not the same as the code I suggested:

screen shot

I agree, the comma looks better than the vertical bar.

There's probably an even nicer style somewhere in between but that can be figured out later

screen shot

@redblobgames do you have suggestions for text classification ?

@prakamya-mishra Sure, see this comment β€” the English/German language recognizer might be a fun interactive visualization. The book suggests email spam/ham, and that might also be fun.

If the data set is large you might run the training offline and generate a json, and then load the json into the browser visualization.

@redblobgames great idea, thanks for the input I will try to do this.

@redblobgames can you please guide me more on how to make a web based language classifier?can you provide me with some demo links and resources?

@prakamya-mishra I've not done a language classifier myself but I think the python and pseudocode versions of aima will have some code to look at.

It's unclear whether a web version is even feasible. That's part of the difficulty of aima-javascript β€” we don't know what will work, so people try things out, and then sometimes have to throw things away if they don't work. I'd suggest first trying a non-web javascript version. How much data is involved? How much computation is involved? A web version can work if the data is reasonably small (maybe a few megabytes) and the computation is fairly fast, or can be split up into smaller pieces. For a classifier, it might be feasible to split the task up into the training done offline (results saved to a file), and the classification in the browser (loading that file and then running the classifier). However, I don't know whether it is actually feasible. It's something that would have to be tried out.

@redblobgames If have made this for text classification part. Please give your input πŸ˜„
ezgif com-video-to-gif 1

@prakamya-mishra that'd be a cool demo. The main goal is to show readers of the textbook what the main concepts are and how the algorithms work. What are the main ideas and algorithms that would be needed to implement that demo? How can we show those to the reader so that they understand how the classification works?

@redblobgames ok got it i will work on it.I am giving a pull request now for this then i will add the visualisation part also.

@redblobgames now i will start working on IR scoring functions.So do you have any suggestions for that? and side by side i am also thinking of an innovative idea for text classification.

@prakamya-mishra The book described the BM25 scoring function, but right now I cannot think of any visualizations that will make that easier to understand. Calculating the score is easy but not very useful. I think seeing the score will not help the reader understand how the function works. I don't have any ideas right now :( @OmarShehata do you have any ideas for visualizations of how IR scoring works?

Ok @redblobgames i will think of some thing and will get back to you.

Ok so having read the chapter, I like the n-gram visualization a lot @prakamya-mishra . It might be fun to add in n-gram models for words as well to be a random selection from a particular source (like a website) like they do in the book - that makes it really obvious that tri-gram models better characterize a source text much better.

The text classification might benefit from showing how it got that result. So perhaps showing what n-grams it used and what the probability was on each one (that way even when it produces an incorrect answer it would be really interesting to see what these partial results were and what might have gone wrong).

As for IR, I think if we really want to try having a visualization for it, then allowing users to explore the scoring function itself would be a good start. When I as a reader go through this chapter and see the complicated BM25 sum, the first thing I want to do is try plugging in different values to see how that affects the score.

What if you could plug in a query string, and a url, and it would generate in mathematical notation of the function with plugged in numbers (with each part labelled that this is the frequency or the inverse frequency etc.) ? I think it would be a really fun exercise to go to google, search for some query, grab the top 3 results, and then be able to see the score breakdown for each. It would be particularly interesting to see how document size and the high/low frequency of the word impacts the the score (and whether the scoring does match Google or some other search engine's ranking).

Another way I think of understanding the behavior of this function is to just see it graphed. If you could see how, for a particular document length, and a particular inverse frequency, how the graph of the word frequency against score would look like (or you could repeat this for score vs any other isolated variable).

I'm personally kind of a fan of the first idea just because it sounds more fun and seems like a practical exercise that allows you to verify how well this score works (ultimately it would be nice if users could create their own scoring function or modify this to try and make it better. I think that's really the active way of learning how this works. But this is starting to sound more like an exercise then).

@OmarShehata Thanks for your inputs ,your idea for IR is great and I will try make that sort of visualisation.

@redblobgames I am thinking of this type of visualisation:
screen shot 2017-10-28 at 5 15 22 pm

@redblobgames This what i have finally made:

ezgif com-video-to-gif 1

I will give a PR after your suggestions πŸ˜„

Nice work @prakamya-mishra ! Do you think you modify this slightly to make it so each document can be a link or a corpus of text (instead of just a word of a phrase) ? I think that'll show a lot more interesting behavior then and could be a real tool to understand this algorithm.

@OmarShehata I was not able to find a good corpus so I have kept a variable there to which a good corpus can be added in future.

screen shot 2017-10-31 at 10 48 55 am

Hey! I am interested in this chapter. Here's a snap of what I have done yet :

2018-03-24_05-39-06_screenshot

The output updates in real time when input is changed or backspaced. It feels quick and nice. You can see it here. Let me know what you think! I've used tangle to change order, name, and calculations in real time.

[Currently it's on my blog, so I've given it styles accordingly.]