200ok-ch/org-parser

Tags with underscore are not parsed correctly

jm-g opened this issue · 2 comments

jm-g commented

According to the Org user guide, tags are defined as follows:

Tags are normal words containing letters, numbers, ‘_’, and ‘@’.

But the parser seems to handle the _ as a format annotation.

(transform (parse "* Headline :tag_a:\n")) 

;; => {:headlines
 [{:headline
   {:level 1,
    :title
    [[:text-normal "Headline :tag"]
     [:text-sub [:text-subsup-word "a"]]
     [:text-normal ":"]],
    :planning [],
    :tags []}}]}

In my opinion, the correct behavior would be

(transform (parse "* Headline :tag_a:\n")) 

;; => {:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "Headline"]],
    :planning [],
    :tags ["tag_a"]}}]}

This is with org-parser 0.1.27 with Clojure on the JVM.

Thanks for the report.

I just tried this:

org-parser.core=> (read-str "* foo  :_:")
{:headlines [{:headline {:level 1, :title [[:text-normal "foo"]], :planning [], :tags ["_"]}}]}

But if "_" is followed by a letter, it doesn't work. Don't yet understand why...
https://github.com/200ok-ch/org-parser/blob/master/src/org_parser/transform.cljc#L66

Oh, I think I got it. extract-tags function does not receive the raw string but the parsed headline text. And the "_" causes the headline text to be parsed to text followed by text-subsup-word (subscript text).

I don't have time currently to work on this. Do you want to give it a try to fix it?

The reason why we didn't parse the tags directly and instead leave it to transform is documented here:
https://github.com/200ok-ch/org-parser/blob/master/resources/org.ebnf#L37