toddsundsted/ktistec

charset/encoding issue

felixkrohn opened this issue · 12 comments

Sorry, it's me again :-/

Since updating (from v2.0.0-8 I believe) to v2.0.0-11 I see broken encoding on posts of other people containing "special" characters (öéüèäà and so on), which are displayed as ü, ä, ö and so on.

What I checked and tried so far without success:

  • check if rendered html contains <meta charset="UTF-8"> ✔️
  • When I publish a post myself, all umlauts, accents and emojis are rendered correctly.
  • tried dist branch instead of v2.0.0-11 git tag
  • verified that other pages exposed by the same reverse proxy (traefik) don't show this issue
  • added packages icu-dev, icu-libs, icu-data-full to build-time and runtime docker images
  • check if the issue is only on display or deeper:
sqlite> select * from objects where id = '294062';
294062|2024-07-10 13:53:09.506|2024-07-10 13:53:09.506|ActivityPub::Object::Note|https://ard.social/users/tagesschau/statuses/112762525013845636|[...]|<p>Dinosaurier-Skelett in Südengland entdeckt</p> [...]

-> looks already incorrectly ingested in the DB, I shared the corresponding post on my ktistec instance.

  • The DB setting seems good to me however:
sqlite> PRAGMA encoding;
UTF-8
  • Posts that were received/ingested/shared before the update are displayed crrectly in the DB as well as in the browser.

Am I the only one with that issue?

a few questions. is it all posts with special characters, or just some of them? can you tell if they are from a specific server? also, what version of sqlite are you running?

i was able to search for and fetch that post into a locally running instance and the characters looked okay (macos and firefox). and into epiktistes.com (linux and chrome). looking back through posts i see a decent number in German, and at least those look reasonable.

one possibility is to see if you can fetch posts (via Search) that have special characters. that would at least narrow down the problem to the inbox handling pipeline (vs. the outbox or just raw fetching)

the thing i'm momentarily hung up on is why your own posts aren't affected...

SQLite3 version 3.45.3
As far as I can see, all such characters since updating are deformed, at least I didn't find any correctly displayed special characters in the "Everything" stream so far

all such characters since updating are deformed

except your own posts, correct?

and what happens if you find a post that your instance hasn't received and you search for it (which fetches it and adds it to your database)? alternatively, can you pick a hashtag and follow that hashtag?

what i'm interested in understanding is, is it only posts that another server pushes to your instance via ActivityPub that are affected, or is it every post regardless of how it is added (direct retrieval).

except your own posts, correct?
yup

what i'm interested in understanding is, is it only posts that another server pushes to your instance via ActivityPub that are affected, or is it every post regardless of how it is added (direct retrieval).

Sorry, now I understood. Yes, I pulled in a few posts not yet in the received set, and they show the same behaviour.

@felixkrohn any chance this could be the issue? crystal-lang/crystal#14803

it would explain why it happens to content coming in in ActivityPub format (json) but not your own content

Hey @toddsundsted that was spot-on. I just rebuilt ktistec v2.0.0-11 using docker.io/crystallang/crystal:1.12-alpine, instead of :latest and the issue seems gone for new posts.

you want a PR for the Dockerfile, or do you think that will be resolved upstream soon?

I can now confirm that using the newest crystal alpine image v1.13.1 fixes the bug:

-FROM crystallang/crystal:latest-alpine AS builder
+#FROM docker.io/crystallang/crystal:latest-alpine AS builder
+FROM docker.io/crystallang/crystal:1.13.1-alpine AS builder

(It has in the meantime also been tagged as latest, so it's not necessary to change the Dockerfile anymore as long as your build process makes sure to not use a different cached version.)

I can now confirm that using the newest crystal alpine image v1.13.1 fixes the bug

great news! thanks for confirming this!