How to split document in tokens in a mixed tagged format
andreasbaumann opened this issue · 2 comments
andreasbaumann commented
Using:
[SearchIndex]
word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();
[ForwardIndex]
text = orig split /posts/post/body//para();
I get:
6 text 'Using'
7 text 'a'
8 text 'static'
9 text 'HTML'
10 text 'generator'
11 text 'now'
12 text 'called'
13 text 'Hugo'
14 text '.'
15 text 'Before'
16 text 'I'
17 text 'used'
18 text 'HTML'
19 text 'and'
20 text 'server-side-includes.'
23 text 'Synchronization'
24 text 'is'
25 text 'done'
26 text 'with'
27 text 'rsync'
28 text 'over'
29 text 'ssh.'
Documentation says it's a split on whitespace. Why do I get sometimes '.'
and somtimes
'word.
'?
Does it depend on the way I'm analyzing for the search index?
andreasbaumann commented
Ah: the original text contains tags:
<para>
Using a static HTML generator now called
<ulink url="https://gohugo.io/">Hugo</ulink>. Before I used HTML and
server-side-includes. Synchronization is done with rsync over ssh. If
you ask yourselves, why no CMS, well, the two wikis/CMS I had before
(I don't mention names) were hacked in no time. And don't want to
spend any time doing security updates all the time.
So the single .
I get after a </ulink>
. Otherwise split does indeed
separate by whitespace.
patrickfrey commented
Tokens crossing segment borders are always splitted.