Hash table possibility instead a list

Question

Hash table possibility instead a list

mtmr0x opened this issue 6 years ago · 8 comments

Premisse

For performance case and better scale of glossary, hash tables based information makes the key<->value match faster and easier to scale in matter of matching data through endpoints.

Case of today

GET and endpoint returns a list of objects for the dictionary, if I want to list them all, that's fine, if I need to match one specific word I have to run through the list looking for it.

Use of hash table

If data could be organised like this:

{
  barril: { dialect: 'barril', meanings: [ ... ], examples: [ ... ] },
  migue: { dialect, 'Migué', meanings: [ ... ], examples: [ ... ] }
}

If I want to understand some word, I would just do:

const barril = baianes['barril'];

That way the key<->value matching would be faster and easier to find a word inside the dictionary.

Other "wins" inside this decision

Organising .json files would be way easier, adding new words either. Instead of having a full file of it, you can place everything in a folder like:

dialects
\_ baianes
   \_ barril.json
   \_ migue.json
    ...etc

The root endpoint would get everything as a collection and display it in a hash table structure in order to show the full collection;
And now you can easily provide a deeper "foldering" endpoint to get specific words, like:

GET https://dialetus-service.now.sh/dialects/baianes/barril-dobrado

response example:

{
  barrilDobrado: {
    "dialect": "Barril Dobrado",
    "meanings": [
      "Problema muito grande",
      "Situação muito complicada",
      "Pessoa de grande Qualiadde"
    ],
    "examples": [
      "Isso ai é barril vey",
      "Você é barril dobrado meu pivete",
      "Eu sou barril dobrado"
    ]
  },
}

Answer 1 · 2018-09-12T23:59:27.000Z

Wow, that is amazing! I liked the proposal, thinking about scalability, long-term and even about other projects that may arise in the future. Please send a PR for the project!

Answer 2 · 2018-09-13T00:05:03.000Z

I'll work on that, but I'm kinda slow these times working in something quite complex. I will start writing the implementation and tech specs for it. 🎉

Answer 3 · 2019-07-03T08:29:52.000Z

Hi there :)

When you're dealing with text search, it is more interesting to have not a hash table, but a prefix tree or a ngram tree, so you can execute partial searches on terms.

A trie is fairly simple to implement with only vanilla libs, but specializing it to ngrams may need some external libs or way more work.

Some inspirational examples:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

https://whoosh.readthedocs.io/en/latest/ngrams.html

One other thing is in regards to Data/Search index model. In most search engines, you have a clear separation between the data you store and the search indexes you use. So a good thing on this idea is to turn the Data model into a K/V store, but keep an index for it in another data structure, so that the data is easily addressable via Key, but searchable via a more elaborate search index :)

Answer 4 · 2019-07-03T12:17:06.000Z

@mateusduboli You're absolutely right about using trees for search purposes, I was being simplistic in my solution and didn't consider a better searchable solution for looking for words. 🤦‍♂

Instead of having a user readable JSON file, we would design some logic to retrieve data from searched characters. It makes sense to me. Is this aligned with the project long run expectations @mvfsillva?

Answer 5 · 2019-07-03T12:26:15.000Z

I think the JSON file idea is not bad at all (see the last paragraph), as it enables a KV like semantics in data retrieval which is ideal. The tree index could be a different PR to implement text searches on the document. I can hack a bit on that as it was sort of a side project of mine to implement a simplistic database.

…

On 3 Jul 2019, at 14:17, Matheus Marsiglio ***@***.***> wrote: @mateusduboli You're absolutely right about using trees for search purposes, I was being simplistic in my solution and didn't consider a better searchable solution for looking for words. 🤦‍♂ Instead of having a user readable JSON file, we would design some logic to retrieve data from searched characters. It makes sense to me. Is this aligned with the project long run expectations @mvfsillva? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 6 · 2019-07-03T15:15:30.000Z

I found it super interesting, I did not know the ngram tree, I think it completely aligns with the expectation of the project.

If I understand this correctly it will improve the performance to look for words and also in the semantics of the data storage

Answer 7 · 2019-09-30T19:58:37.000Z

Hey, guys, @mtmr0x @mateusduboli let's do it \0/

Answer 8 · 2019-10-01T06:06:44.000Z

I have spent some time studying what lucene does internally to get some inspiration. This weekend I'll open a PR :)

…

On 30 Sep 2019, at 21:58, Marcus Silva ***@***.***> wrote: Hey, guys, @mtmr0x @mateusduboli let's do it \0/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.