mvfsillva/dialetus-service

Hash table possibility instead a list

mtmr0x opened this issue · 8 comments

Premisse

For performance case and better scale of glossary, hash tables based information makes the key<->value match faster and easier to scale in matter of matching data through endpoints.

Case of today

GET and endpoint returns a list of objects for the dictionary, if I want to list them all, that's fine, if I need to match one specific word I have to run through the list looking for it.

Use of hash table

If data could be organised like this:

{
  barril: { dialect: 'barril', meanings: [ ... ], examples: [ ... ] },
  migue: { dialect, 'Migué', meanings: [ ... ], examples: [ ... ] }
}

If I want to understand some word, I would just do:

const barril = baianes['barril'];

That way the key<->value matching would be faster and easier to find a word inside the dictionary.

Other "wins" inside this decision

Organising .json files would be way easier, adding new words either. Instead of having a full file of it, you can place everything in a folder like:

dialects
\_ baianes
   \_ barril.json
   \_ migue.json
    ...etc
  • The root endpoint would get everything as a collection and display it in a hash table structure in order to show the full collection;
  • And now you can easily provide a deeper "foldering" endpoint to get specific words, like:
GET https://dialetus-service.now.sh/dialects/baianes/barril-dobrado

response example:

{
  barrilDobrado: {
    "dialect": "Barril Dobrado",
    "meanings": [
      "Problema muito grande",
      "Situação muito complicada",
      "Pessoa de grande Qualiadde"
    ],
    "examples": [
      "Isso ai é barril vey",
      "Você é barril dobrado meu pivete",
      "Eu sou barril dobrado"
    ]
  },
}

Wow, that is amazing! I liked the proposal, thinking about scalability, long-term and even about other projects that may arise in the future. Please send a PR for the project!

I'll work on that, but I'm kinda slow these times working in something quite complex. I will start writing the implementation and tech specs for it. 🎉

Hi there :)

When you're dealing with text search, it is more interesting to have not a hash table, but a prefix tree or a ngram tree, so you can execute partial searches on terms.

A trie is fairly simple to implement with only vanilla libs, but specializing it to ngrams may need some external libs or way more work.

Some inspirational examples:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

https://whoosh.readthedocs.io/en/latest/ngrams.html

One other thing is in regards to Data/Search index model. In most search engines, you have a clear separation between the data you store and the search indexes you use. So a good thing on this idea is to turn the Data model into a K/V store, but keep an index for it in another data structure, so that the data is easily addressable via Key, but searchable via a more elaborate search index :)

@mateusduboli You're absolutely right about using trees for search purposes, I was being simplistic in my solution and didn't consider a better searchable solution for looking for words. 🤦‍♂

Instead of having a user readable JSON file, we would design some logic to retrieve data from searched characters. It makes sense to me. Is this aligned with the project long run expectations @mvfsillva?

I found it super interesting, I did not know the ngram tree, I think it completely aligns with the expectation of the project.

If I understand this correctly it will improve the performance to look for words and also in the semantics of the data storage

Hey, guys, @mtmr0x @mateusduboli let's do it \0/