Memory lightweight implementation of a white/black list
gem install enf
Building a dictionnary of all terms used in 'Les misérables'
require 'enf'
require 'open-uri'
URI = 'https://www.gutenberg.org/ebooks/135.txt.utf-8'
elephant = Enf::Elephant.new
open(URI) do |file|
file.read.scan(/[[:alpha:]]*/).each do |token|
elephant.register! token.downcase
end
end
elephant.include? 'bonjour'
# => true
elephant.include? 'megadrive'
# => false
Building a shared blacklist with a rack app
require 'rack'
require 'enf'
require 'JSON'
elephant = Enf::Elephant.new
app = Proc.new do |env|
puts env.inspect
path = env.fetch('PATH_INFO')
case path
when '/'
['200', {'Content-Type' => 'text/html'}, ['A sample elephant black list rack app']]
when /^\/add\//
elephant.register!(token = path[5..-1])
['200', {'Content-Type' => 'text/html'}, ["Registered '#{token}'"]]
when /^\/know\//
result = elephant.include?(token = path[6..-1])
['200', {'Content-Type' => 'text/json'}, [{ token => result }.to_json]]
else
['404', {'Content-Type' => 'text/html'}, ['Learn to talk elephantish']]
end
end
Rack::Handler::WEBrick.run app
Instead of storing all listed values like follow:
{
"foo" => true,
"bar" => true,
"foobar" => true,
# ...
"foobob" => true
}
Those are stored in several nested hashes like follow:
{
"f" => {
"o" => {
"o" => { # leave
"b" => {
"a" => "r", # leave
"o" => "b" # leave
}
}
}
},
"b" => {
"a" => "r" # leave
}
}
Avoiding duplication of factorizable content.
Enf::Elephant being sensibly slower than using sets or hashes on small volume of data, it is recommended to use it when memory matters most and when words are easily factorizable (example : as a black list for twitter spam bots).
Enf::Elephant are now able to propose completion candidates from a word beginning, thanks to the 'suggest' method:
require 'enf/suggest'
require 'open-uri'
URI = 'https://www.gutenberg.org/ebooks/135.txt.utf-8'
elephant = Enf::Elephant.new
open(URI) do |file|
file.read.scan(/[[:alpha:]]*/).each do |token|
elephant.register! token.downcase
end
end
suggestions = elephant.suggest('ele', limit: 5, incompletes: false)
# => #<Set: {"ven", "venth", "vate", "vated", "vates", "ment",
# "ments", "gant", "gance", "gy", "onore", "phant",
# "ct", "ctor", "ctric", "cted", "usiac"}>
puts suggestions.map { |ending| 'ele' + ending }.inspect
# => ["eleven", "eleventh", "elevate", "elevated", "elevates",
# "element", "elements", "elegant", "elegance", "elegy",
# "eleonore", "elephant", "elect", "elector", "electric",
# "elected", "eleusiac"]
Parameters detail:
- The first non optional parameter is the beginning of the words to search for.
- The limit optional parameter (default: :none) is the number of characters to allow the elephant to inquire for. That parameter allow 2 kinds of value: the :none symbol, that search deeply in the graph for corresponding completion candidates, or a strictly positive integer. (Example: with the value 2, the previous example would have only returned elegy and elect).
- The incompletes optional parameter (default: false) is a boolean parameter allowing to add intermediates incompletes words to the result set. (Example: with limit = 2 and incompletes = true, the previous result set would have been: ['ve', 'va', 'me', 'ga', 'gy', 'on', 'ph', 'ct', 'us'])
I will probably be working on:
- Single path merging: Now, when a single word uses a specific path (example: bar, in the previous example), a hash is created for every letter of that specific path. We have to merge theim as a single element to avoid useless hash creation.
- tbd