yogthos/markdown-clj

Describe best practices for sanitising non-markdown HTML properly with markdown-clj

Ashe opened this issue · 2 comments

Ashe commented

Hey there, I was wondering if it was possible to maybe show more examples using :replacement-transformers as I am not so sure about how I'd go about escaping HTML properly.

                ;; Body
                [:div
                  {:dangerouslySetInnerHTML
                    {:__html  (md/md->html 
                                  (:post-summary p)
                                  :replacement-transformers
                                  (cons escape-html mdt/transformer-vector))}}]


(def ^:dynamic ^:no-doc *html-mode* :xhtml)

(defn- escape-html
  "Change special characters into HTML character entities."
  [text state]
  [ (if (and (not= :code state) (not= :codeblock state))
      (-> text
        (s/replace #"&"  "&")
        (s/replace #"<"  "&lt;")
        (s/replace #">"  "&gt;")
        (s/replace #"\"" "&quot;")
        (s/replace #"'" (if (= *html-mode* :sgml) "&#39;" "&apos;")))
      text) state])

The code above attempts to strip out any HTML that has not been created by markdown-clj and succeeds, but, it does fail when HTML has been placed inside of a code block. Seeing as it is important to sanitise HTML from queries but also to maintain the ability to produce readable code blocks, I feel like it'd be useful to new users to show an example of going about this.

Ashe commented

As described in #36 I tried this:

(defn- escape-html
  "Change special characters into HTML character entities."
  [text state]
  (let [sanitized-text 
          (clojure.string/escape text 
             {\& "&amp;" 
              \< "&lt;" 
              \> "&gt;" 
              \" "&quot;"
              \' "&#39;"})]
    [(if (not (or (:code state) (:codeblock state)))
      sanitized-text text) state]))

This works for the most part, although single-line code blocks get escaped. This will have to do for now though.

Hi,

Yeah, avoiding escaping within inline code is trickier, so this is a reasonable approach. As a note, it's probably better to do the check for code block before sanitizing since that will avoid the work in cases where it's not needed. I'll add this as an example in the docs to help others running into this problem.