dstogov/php-ffi

Add unicode string

testAccountDeltas opened this issue · 14 comments

$libc = new FFI("int MessageBoxW(
uint32_t hWnd,
void * lpText,
void * lpCaption,
uint32_t uType);", "User32.dll");

$libc->MessageBoxW(0, 'Hello Test', 'Caption', 2);

Result Hieroglyphs!!!! not english:

`---------------------------
慃瑰潩n�

效汬敔瑳

Прервать Повтор Пропустить

`

В общем, произвожу вызов юникодного MessageBoxW и получаю иероглифы на выходе

Очень неприятная ситуация, не иметь юникод

Например, работа юникода полезна для работы с буфером обмена в Windows

На данный момент нет возможности использовать CF_UNICODETEXT

Unicode text format. Each line ends with a carriage return/linefeed (CR-LF) combination. A null character signals the end of the data.
https://docs.microsoft.com/en-us/windows/desktop/dataxchg/standard-clipboard-formats

Hey, I surprised this works for Windows at all. I even didn't try to built this on Windows.
I agree, we will need special support for Unicide string.

@weltling any ideas how to fix this. In general, we may cover strings to UNICODE, if argument is wchar_t*, but covering it for "void*" doesn't make sense. May be this should be covered in user space, using ext/mbstring or ext/intl functons?

@dstogov, yeah, most likely it's a user space task to convert. Crossplatform it would be mbstring, Windows only there's a conversion API used for UTF-8 path support. I'll check that and come back.

Thanks.

I come with this snippet after playing a bit

<?php

$php = FFI::cdef("uint16_t *php_win32_cp_conv_utf8_to_w(const char* in,
                        size_t in_len,
                        size_t *out_len);",
                "php7_debug.dll");

$user32 = FFI::cdef("int MessageBoxW(uint32_t hWnd,
                        uint16_t* lpText,
                        uint16_t* lpCaption,
                        uint32_t uType);",
                "User32.dll");

/* This should be size_t* with value ((size_t*)-1),
but ideally just the value could be passed. */
$ign = FFI::new("size_t");
$ign = 0;

$txt = "Kalte Füße";
$txt = $php->php_win32_cp_conv_utf8_to_w($txt, strlen($txt), FFI::addr($ign));
$cap = "Весёлые истории";
$cap = $php->php_win32_cp_conv_utf8_to_w($cap, strlen($cap), FFI::addr($ign));
$user32->MessageBoxW(0, $txt, $cap, 2);
unset($ign);
// free $txt and $cap ???

@testAccountDeltas be sure saving this as UTF-8. For string conversion Windows only this might be useful https://github.com/php/php-src/blob/master/win32/codepage.h#L64, it allows also conversions from/to any ANSI codepage. It doesn't look like wchar_t or C++11 char16_t is recognized, in this case using uint16_t is the only viable option and seems ok. With this, I don't see any issue using UNICODE APIs in in this particular case. And in general - it is same as having any other pointer type in libffi. With wchar_t support it probably depends on libffi, some API might have an issue, but lets wait for those to show up :)

@dstogov two things did popup, while being on it.

  • instead of $ign I'd actually pass ((size_t*)-1), but somehow I couldn't find a way to do that. Based on that, perhaps also in other cases one couldn't come to some limit values like SIZE_MAX, etc.
  • the return of php_win32_cp_conv_utf8_to_w should be explicitly free'd, but I currently get heap corrupted when trying to use FFI::free($txt). The readme tells these pointers shouldn't be owned, but stepping through ffi.c:3401 shows cdata->flags as zero. I should check more on this, as the C function uses normal malloc and the data itself is unlikely to be copied using ZMM.

Thanks.

@wltling

  • I didn't get what is wrong with ((size_t*)-1)
$f = FFI::cdef("int printf(const char *s, size_t x);");
$ign = FFI::new("size_t");
$ign = -1;
$f->printf("hello 0x%x!\n", $ign);' // prints "hello 0xffffffff!"
  • good catch! Pointer returned by C function are NOT-owned (README is correct https://github.com/dstogov/php-ffi#owned-and-not-owned-cdata), but there is a mess with PHP vs SYSTEM heap. Currently pointer is allocated by malloc() and released by efree(), that leads to the problem. I'l think how to improve this.

@dstogov thanks for checking. The issue with $ign = FFI::new("size_t"); in this case is, that it's not a pointer. Say a simpler hypothetic function

char *gen_str(size_t *n) {
        size_t len;

        if (((size_t*)-1) == n) {
                len = 42
        } else {
                len = *n;
        }

        return (char *)malloc(len*sizeof(char));
}

In plain C calling gen_str(((size_t*)-1)) would pass a pointer 0xffffffffffffffff. But having a var $ign = FFI::new("size_t"); $ign = -1; - it's still not a pointer. What i've tried yet is to add $ign = FFI::cast("size_t*", $ign); which would actually work in plain C, but then another error comes Object of class FFI\CData could not be converted to int. Sure, it's a tricky case.

In the original case, wchar_t *php_win32_cp_conv_utf8_to_w(const char* in, size_t in_len, size_t *out_len), passing 0xffffffffffffffff to the last arg designates a special case, where the function is instructed to not to set the output length. That allows creating some shortcut macros like here https://github.com/php/php-src/blob/master/win32/codepage.h#L66. That's the actual use case.

Thanks.

@weltling, I got it. I'll check if FFI::cast() may be extended to support int->ptr conversion.

@weltling, The heap corruption problem should be fixed now.

@weltling, FFI::cast("sizeof*", -1) should work now.

@dstogov thanks for the fixes. I can confirm FFI::free() works correctly now.

With FFI::cast(), the actual example above works now like here (partial snippet)

$ign = FFI::new("size_t");
$ign = -1;
$txt = $php->php_win32_cp_conv_utf8_to_w("hello", strlen("hello"), FFI::cast("size_t*", $ign));

but otherwise, please consider also the notes below for possible corrections.

One issue with a missing return here https://github.com/dstogov/php-ffi/blob/master/ffi.c#L3448 causes a snippet like this var_dump(FFI::cast("size_t*", -1)); to crash.

Seems there's also an issue with getting properties, fe this snippet works as expected

$ign = FFI::new("size_t");
$ign = 0;
var_dump($ign, FFI::cast("size_t*", $ign));

but as soon as the value is -1, there's a crash

$ign = FFI::new("size_t");
$ign = -1;
var_dump($ign, FFI::cast("size_t*", $ign));

Yes another case, which might be considered semantically correct in PHP, but is invalid in plain C

$ign = FFI::new("size_t");
$ign = 42;
$ign = FFI::cast("size_t*", $ign);

In PHP the RHS expression can be actually of an arbitrary type. Perhaps FYI if this should actually be supported, too. This currently still throws Object of class FFI\CData could not be converted to int error.

Thanks.

@weltling

One issue with a missing return here https://github.com/dstogov/php-ffi/blob/master/ffi.c#L3448 causes a snippet like this var_dump(FFI::cast("size_t*", -1)); to crash.

This is a var_dump() problem that tries to dereference the invalid pointer. I'm not sure what can be done with this.

Seems there's also an issue with getting properties, fe this snippet works as expected

$ign = FFI::new("size_t");
$ign = 0;
var_dump($ign, FFI::cast("size_t*", $ign));

NULL is a special pointer value. var_dump() doesn't try to dereference it, so it works.

but as soon as the value is -1, there's a crash

$ign = FFI::new("size_t");
$ign = -1;
var_dump($ign, FFI::cast("size_t*", $ign));

Right, This is the same var_dump() problem.

Yes another case, which might be considered semantically correct in PHP, but is invalid in plain C

$ign = FFI::new("size_t");
$ign = 42;
$ign = FFI::cast("size_t*", $ign);

In PHP the RHS expression can be actually of an arbitrary type. Perhaps FYI if this should actually be supported, too. This currently still throws Object of class FFI\CData could not be converted to int error.

The exception is thrown when you assign "size_t*" to $ign of type "size_t". The message might be more specific, but the error seems correct to me.

Thanks.

@dstogov yep, the last case is invalid in C. Probably would be ok to leave it as is in FFI, too. Otherwise it would be too much complication.

As for me, this issue is resolved. UNICODE turned out to be usable anyway, other questions are also clarified.

Thanks.