Zatvobor/tirexs

get unicode strings

markschultz opened this issue · 6 comments

when I do a get on a specific ID some fields are unprintable yet valid binary strings. when I do a search for the same object, the same fields are now printable. I've narrowed it down to fields that contain "'" (0x2019) or "—" (0x2014)

May I ask you to provide a bit more details? Would be nice to have a way to reproduce it (a bunch of iex> lines would be helpful, take a look at #224 issue for example).

Thanks.

the elasticsearch object:

{
  "MeetingTitle": "Planner— A",
  "Id": 1
}

the method i'm using to search:

elasticquery = search [index: "v1", size: 1] do
  query do
    bool do
      filter do
        term "Id", "1" # in this case the object id is the same as the elasticsearch _id.
      end
    end
  end
end
results = Tirexs.Query.create_resource(elasticquery)
# inspect [:_source][:MeetingTitle], see printable, valid, string

method for get:

results2 = get("v1/meetings/1")
# inspect [:_source][:MeetingTitle], see unprintable, valid, string

I'm able to do the same thing with curl and the strings both appear to be the same which leads me to believe the issue is not with elasticsearch or my data.

let me know if you need more details please.

here is what I have:

➜  tirexs git:(master) ✗ iex -S mix
Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Interactive Elixir (1.2.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Tirexs.HTTP.put("/index/type/1", [MeetingTitle: "Planner— A", Id: 1])
{:ok, 201,
 %{_id: "1", _index: "index", _shards: %{failed: 0, successful: 1, total: 2},
   _type: "type", _version: 1, created: true}}

iex(2)> {:ok, 200, %{_source: %{MeetingTitle: meeting_title}}} = Tirexs.HTTP.get("/index/type/1")
{:ok, 200,
 %{_id: "1", _index: "index",
   _source: %{Id: 1,
     MeetingTitle: <<80, 108, 97, 110, 110, 101, 114, 195, 162, 194, 128, 194, 148, 32, 65>>},
   _type: "type", _version: 1, found: true}}

iex(3)> String.valid?(meeting_title)
true

iex(36)> Kernel.is_bitstring(meeting_title)
true

Hope, it would be helpful for you )

yes, this is exactly what i'm seeing. if you try String.printable?(meeting_title) i think you'll get false. When I try to print that byte string i get Planner� A When I get that meeting via search (see above search) inspecting the MeetingTitle in iex i get

   _source: %{Id: 1,
     MeetingTitle: "Planner— A"}

I see.I'll try to play with it. It looks like an elastic issue. Will ping you back.

looks like latin1/utf8 encoding issue? maybe here:

JSX.decode!(to_string(json), opts)
or
:httpc.request(method, request, http_options, options)
?