patrickfrey/strusAnalyzer

How to split document in tokens in a mixed tagged format

andreasbaumann opened this issue · 2 comments

Using:

[SearchIndex]
	word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();

[ForwardIndex]
	text = orig split /posts/post/body//para();

I get:

6 text 'Using'
7 text 'a'
8 text 'static'
9 text 'HTML'
10 text 'generator'
11 text 'now'
12 text 'called'
13 text 'Hugo'
14 text '.'
15 text 'Before'
16 text 'I'
17 text 'used'
18 text 'HTML'
19 text 'and'
20 text 'server-side-includes.'
23 text 'Synchronization'
24 text 'is'
25 text 'done'
26 text 'with'
27 text 'rsync'
28 text 'over'
29 text 'ssh.'

Documentation says it's a split on whitespace. Why do I get sometimes '.' and somtimes
'word.'?

Does it depend on the way I'm analyzing for the search index?

Ah: the original text contains tags:

<para>
  Using a static HTML generator now called
  <ulink url="https://gohugo.io/">Hugo</ulink>. Before I used HTML and
  server-side-includes. Synchronization is done with rsync over ssh. If
  you ask yourselves, why no CMS, well, the two wikis/CMS I had before
  (I don't mention names) were hacked in no time. And don't want to
  spend any time doing security updates all the time.

So the single . I get after a </ulink>. Otherwise split does indeed
separate by whitespace.

Tokens crossing segment borders are always splitted.