KunstDerFuge/dChan

ElasticSearch probably isn't needed at all and just makes things a lot more complicated

Opened this issue · 3 comments

In #1 I mentioned how ES is probably unneeded for pagination and other simple tasks.

Actually now I think it's totally unneeded.

Running to get simple field-only searches:

CREATE EXTENSION pg_trgm;
CREATE INDEX posts_post_fk_view_body_tsvector_search_idx ON posts_post_fk_view USING GIN(TO_TSVECTOR('english', body));
CREATE INDEX posts_post_fk_view_author_search_idx ON posts_post_fk_view USING GIN(TO_TSVECTOR('english', author));
CREATE INDEX posts_post_fk_view_subject_search_idx ON posts_post_fk_view USING GIN(TO_TSVECTOR('english', subject));
CREATE INDEX posts_post_fk_view_tripcode_search_idx ON posts_post_fk_view USING GIN(TO_TSVECTOR('english', tripcode));
CREATE INDEX posts_post_fk_view_poster_hash_search_idx ON posts_post_fk_view USING GIN(TO_TSVECTOR('english', poster_hash));

And for complex pseudo-documents:

CREATE MATERIALIZED VIEW posts_post_documents AS
 SELECT posts_post_fk_view.id,
    json_build_array(ARRAY[posts_post_fk_view.platform_name, posts_post_fk_view.post_id_t, posts_post_fk_view.thread_id_t, posts_post_fk_view.board_name, posts_post_fk_view."timestamp"::text, posts_post_fk_view.replies::json::text])::text AS metadata,
    json_build_array(ARRAY[posts_post_fk_view.author, posts_post_fk_view.tripcode, posts_post_fk_view.subject, posts_post_fk_view.body::character varying, posts_post_fk_view.poster_hash])::text AS document,
    posts_post_fk_view.platform_name,
    posts_post_fk_view.board_name,
    posts_post_fk_view.thread_id,
    posts_post_fk_view.post_id
   FROM posts_post_fk_view;

CREATE INDEX posts_post_document_merged_metadata_tsvector_search_idx ON posts_post_documents USING GIN(TO_TSVECTOR('english', metadata));
CREATE INDEX posts_post_document_merged_bodies_tsvector_search_idx ON posts_post_documents USING GIN(TO_TSVECTOR('english', document));

I hardly think we can do better than this.

Document search for damnatio AND memoriae completing in 0.123ms

\pset format html
EXPLAIN ANALYZE SELECT d.document document, p.* FROM posts_post_documents d JOIN posts_post_fk_view p ON p.id = d.id WHERE TO_TSVECTOR('english'::REGCONFIG, document) @@ TO_TSQUERY('damnatio & memoriae');
QUERY PLAN
Nested Loop (cost=36.70..75.11 rows=3 width=869) (actual time=0.082..0.086 rows=1 loops=1)
  -> Bitmap Heap Scan on posts_post_documents d (cost=36.27..49.74 rows=3 width=171) (actual time=0.064..0.065 rows=1 loops=1)
        Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('damnatio & memoriae'::text))
        Heap Blocks: exact=1
        -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..36.27 rows=3 width=0) (actual time=0.054..0.054 rows=1 loops=1)
              Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('damnatio & memoriae'::text))
  -> Index Scan using posts_post_fk_view_pkey on posts_post_fk_view p (cost=0.43..8.45 rows=1 width=706) (actual time=0.012..0.014 rows=1 loops=1)
        Index Cond: (id = d.id)
Planning Time: 0.589 ms
Execution Time: 0.123 ms

(10 rows)

Body search for damnatio OR memoriae completing in 17.260ms

\pset format csv
SELECT board_name, platform_name, thread_id, post_id, author, LEFT(body, 300) FROM posts_post_fk_view WHERE TO_TSVECTOR('english'::REGCONFIG, body) @@ TO_TSQUERY('damnatio|memoriae');
board_name,platform_name,thread_id,post_id,author,left
qresearch,8kun,42982,43202,Anonymous,">>42430  possibly ""In ancient Rome, the practice of damnatio memoriae was the condemnation of Roman elites and emperors after their deaths. If the senate or a later emperor did not like the acts of an individual, they could have his property seized, HIS NAME ERASED and his statues reworked. Because "
qresearch,8kun,2818380,2818623,Anonymous,"Any occult anons noticed the similarity between Raymon Lull's wheels, intellectus, voluntas and memoria and the concentric rotating structure of the Q clock? 
Very different basis, Lulls was based on the attributes of God - perhaps an occult anon knows more?"
qresearch,8kun,10009310,10009661,Anonymous,">>10009631
==List of Holocaust movies 1990's==
Year 	Country 	Title 	Director 	Notes
1990 	United States 	C.A.N.D.L.E.S.: The Story of the Mengele Twins (TV) 	Gordon J. Murray 	
1991 	France 	Premier convoi (TV) 	Jacky Assoun and Suzette Bloch 	
1991 	United Kingdom 	Purple Triangles 	Martin Smith 	"
QUERY PLAN
Gather (cost=1166.26..46806.09 rows=14194 width=64) (actual time=0.854..17.227 rows=3 loops=1)
  Output: board_name, platform_name, thread_id, post_id, author, ("left"(body, 300))
  Workers Planned: 2
  Workers Launched: 2
  -> Parallel Bitmap Heap Scan on public.posts_post_fk_view (cost=166.26..44386.69 rows=5914 width=64) (actual time=0.176..0.196 rows=1 loops=3)
        Output: board_name, platform_name, thread_id, post_id, author, "left"(body, 300)
        Recheck Cond: (to_tsvector('english'::regconfig, posts_post_fk_view.body) @@ to_tsquery('damnatio|memoriae'::text))
        Heap Blocks: exact=3
        Worker 0: actual time=0.008..0.009 rows=0 loops=1
        Worker 1: actual time=0.008..0.008 rows=0 loops=1
        -> Bitmap Index Scan on posts_post_fk_view_body_tsvector_search_idx (cost=0.00..162.71 rows=14194 width=0) (actual time=0.487..0.488 rows=3 loops=1)
              Index Cond: (to_tsvector('english'::regconfig, posts_post_fk_view.body) @@ to_tsquery('damnatio|memoriae'::text))
Planning Time: 0.227 ms
Execution Time: 17.260 ms

(14 rows)

Prefixed words are allowed.

Counting documents w/Kekistan*, Kekistan and Kekistani in 0.141ms

has_word_that_begins_with_kekistan has_kekistan has_kekistani has_kekistan*-has_kekistan-has_kekistani
114 44 70 0
EXPLAIN ANALYZE SELECT s.*, s.has_word_that_begins_with_kekistan - s.has_kekistani - s.has_kekistan AS "has_kekistan*-has_kekistan-has_kekistani" FROM (SELECT (SELECT COUNT(id) FROM posts_post_documents d WHERE TO_TSVECTOR('english'::REGCONFIG, document) @@ TO_TSQUERY('Kekistan:*')) has_word_that_begins_with_kekistan, (SELECT COUNT(id) FROM posts_post_documents d WHERE TO_TSVECTOR('english'::REGCONFIG, document) @@ TO_TSQUERY('Kekistan')) has_kekistan, (SELECT COUNT(id) FROM posts_post_documents d WHERE TO_TSVECTOR('english'::REGCONFIG, document) @@ TO_TSQUERY('Kekistani')) has_kekistani) s;
QUERY PLAN
Result (cost=46338.92..46338.93 rows=1 width=32) (actual time=0.325..0.326 rows=1 loops=1)
  InitPlan 1 (returns $0)
    -> Aggregate (cost=7728.48..7728.49 rows=1 width=8) (actual time=0.144..0.144 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d (cost=43.51..7723.55 rows=1969 width=8) (actual time=0.047..0.137 rows=114 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text))
                Heap Blocks: exact=113
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..43.02 rows=1969 width=0) (actual time=0.038..0.038 rows=114 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text))
  InitPlan 2 (returns $1)
    -> Aggregate (cost=7720.48..7720.49 rows=1 width=8) (actual time=0.025..0.026 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d_1 (cost=35.51..7715.55 rows=1969 width=8) (actual time=0.011..0.023 rows=44 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan'::text))
                Heap Blocks: exact=44
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..35.02 rows=1969 width=0) (actual time=0.007..0.007 rows=44 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan'::text))
  InitPlan 3 (returns $2)
    -> Aggregate (cost=7720.48..7720.49 rows=1 width=8) (actual time=0.036..0.036 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d_2 (cost=35.51..7715.55 rows=1969 width=8) (actual time=0.013..0.033 rows=70 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistani'::text))
                Heap Blocks: exact=70
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..35.02 rows=1969 width=0) (actual time=0.008..0.008 rows=70 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistani'::text))
  InitPlan 4 (returns $3)
    -> Aggregate (cost=7728.48..7728.49 rows=1 width=8) (actual time=0.063..0.063 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d_3 (cost=43.51..7723.55 rows=1969 width=8) (actual time=0.027..0.059 rows=114 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text))
                Heap Blocks: exact=113
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..43.02 rows=1969 width=0) (actual time=0.022..0.022 rows=114 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text))
  InitPlan 5 (returns $4)
    -> Aggregate (cost=7720.48..7720.49 rows=1 width=8) (actual time=0.032..0.032 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d_4 (cost=35.51..7715.55 rows=1969 width=8) (actual time=0.011..0.029 rows=70 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistani'::text))
                Heap Blocks: exact=70
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..35.02 rows=1969 width=0) (actual time=0.007..0.007 rows=70 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistani'::text))
  InitPlan 6 (returns $5)
    -> Aggregate (cost=7720.48..7720.49 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
          -> Bitmap Heap Scan on posts_post_documents d_5 (cost=35.51..7715.55 rows=1969 width=8) (actual time=0.008..0.020 rows=44 loops=1)
                Recheck Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan'::text))
                Heap Blocks: exact=44
                -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..35.02 rows=1969 width=0) (actual time=0.005..0.005 rows=44 loops=1)
                      Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan'::text))
Planning Time: 0.362 ms
Execution Time: 0.374 ms

(45 rows)

Showing 5 «Kekistan»-prefixed documents (in metadata or document body) in 0.069ms

EXPLAIN ANALYZE SELECT * FROM posts_post_documents d WHERE TO_TSVECTOR('english'::REGCONFIG, document) @@ TO_TSQUERY('Kekistan:*') OR TO_TSVECTOR('english'::REGCONFIG, metadata) @@ TO_TSQUERY('Kekistan:*') LIMIT 5;
QUERY PLAN
Limit (cost=76.42..98.04 rows=5 width=280) (actual time=0.049..0.054 rows=5 loops=1)
  -> Bitmap Heap Scan on posts_post_documents d (cost=76.42..10840.37 rows=2490 width=280) (actual time=0.048..0.052 rows=5 loops=1)
        Recheck Cond: ((to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text)) OR (to_tsvector('english'::regconfig, metadata) @@ to_tsquery('Kekistan:*'::text)))
        Heap Blocks: exact=5
        -> BitmapOr (cost=76.42..76.42 rows=2491 width=0) (actual time=0.039..0.040 rows=0 loops=1)
              -> Bitmap Index Scan on posts_post_document_merged_bodies_tsvector_search_idx (cost=0.00..43.02 rows=1969 width=0) (actual time=0.034..0.034 rows=114 loops=1)
                    Index Cond: (to_tsvector('english'::regconfig, document) @@ to_tsquery('Kekistan:*'::text))
              -> Bitmap Index Scan on posts_post_document_merged_metadata_tsvector_search_idx (cost=0.00..32.16 rows=522 width=0) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (to_tsvector('english'::regconfig, metadata) @@ to_tsquery('Kekistan:*'::text))
Planning Time: 0.131 ms
Execution Time: 0.069 ms
id metadata document platform_name board_name thread_id post_id
297754 [["8kun","5553116","5552473","qresearch","2019-03-07 02:04:23-05","{}"]] [["Anonymous","","",">>5552927\nOh wow, shills and red blooded Kekistanis gonna have a field day with that one anon.","718d4f"]] 8kun qresearch 5552473 5553116
307070 [["8kun","5616186","5615981","qresearch","2019-03-10 22:54:21-04","{}"]] [["Anonymous","","",">>5616119\nOh, Zuzana! Oh Dontchyou cry for meme!\nI come from Kekistana, with the Anons on my team!","937359"]] 8kun qresearch 5615981 5616186
322926 [["8kun","5669690","5669434","qresearch","2019-03-13 21:31:53-04","{}"]] [["Anonymous","","",">>5669614\nEast Kekistan.\nThey got the island from West Kekistan about 10 years ago, when East Kekistan offered a sham peace.","a9c51f"]] 8kun qresearch 5669434 5669690
367220 [["8kun","982990","982209","qresearch","2018-04-10 11:54:29-04","{}"]] [["Anon","","https://8ch.net/qresearch/res/877198.html#q982962","==RED TEXT== \"'SAFETY & EDUCATION'\" OF NEW VISITORS TO Q Research\n\n@ ## Board Owner\n@ Kekistani Air Patrol, etc.\n@ Poster of ['Anonymous 04/03/18 (Tue) 16:35:54 9b4725 No.880021'] / [https:// 8ch.net/ qresearch/res/ 879967.html#880021]\n\nWhen I woke up this morning, it occurred to me: \"The First Principle of All Warfare is -→ \"'selection and maintenance of the aim'\"\n\nThis is war (even in this comparatively benign theatre of operations); \n==RED TEXT== The overriding [DJT/WH/Admiral/Q/whitehat/anon/autist children at play] aim in this war, in all theatres, is the protection of an innocent public from a bunch of Evil, wily, \"'murderous'\", Literally Satanic motherfuckers\n\nThe principle threat to members of the innocent public in this theatre (\"'Q Research'\", but especially \"'Q Research General'\") is foolish auto-doxxing\n\n \"It's a trap!\" – Admiral Akbar\n \"This is not a game.\" – Q\n\nIf they'll drop a plane with 100 people right out of the sky, to kill one whistleblower, plainly they won't hesitate to fuck up & fuck over any/all innocent normies __that they can identify__, in a thousand lesser ways.\n\n==RED TEXT== Baking The Safety Right In:\n\nRecommend that the attached meme be either somehow pinned to the top of every new bread (\"'and'\" at the top of the catalog, \"'and'\" somehow if possible in every other part of \"'Q Research'\" besides \"'Q Research General'\"). \n\n[Or something very much like it, but as this is an \"Achtung! Minen!\" time warning (\"'Stand back, pplz, everybody please stand back'\"), variety is bad and repetition (Pavlov) is good.]\n\nIn fact, despite the problem of evil shillclown motherfuckers trying to flood the breads and board with noise in the channel, in the interests of the safety of arriving, clueless, normie new visitors, I recommend this warning or something like it be dropped as a naked meme into the growing breads maybe roughly once every 250 posts (so, ~3 times per bread).\n\nAs this board gets increasing attention over the next weeks,","9af583"]] 8kun qresearch 982209 982990
372236 [["8kun","1006063","1005558","qresearch","2018-04-11 22:16:56-04","{}"]] [["Anonymous","","",">>1006024\n\nShe looks like Kekistani","d7acac"]] 8kun qresearch 1005558 1006063

(5 rows)

This is excellent -- I had originally been using SearchVector in the pre-elasticsearch codebase, hence the vestigial DB column, but wasn't satisfied with the default behavior (iirc, multiple keywords combined as strictly either AND or OR), but search_type='websearch' looks like it would be a great drop-in replacement for the current behavior. Will try it locally and see how it compares.