Does it really work with extended ASCII used for Cyrillic symbols?
dmikushin opened this issue · 6 comments
The Cyrillic support added by this project raises an interesting problem.
If the source files are saved in Unicode (which is usual for Linux), the '«' and '»' symbols are represented in UTF-8. Char codes for these symbols are outside of the 0..255 range of 1-byte char, which is strongly pointed out by Clang:
talk-llama-fast.cpp:1248:55: error: character too large for enclosing character literal type
1248 | text_heard = RemoveTrailingCharacters(text_heard, '«');
| ^
talk-llama-fast.cpp:1249:55: error: character too large for enclosing character literal type
1249 | text_heard = RemoveTrailingCharacters(text_heard, '»');
I think what you meant to have was actually the same symbols of extended ASCII table given by 0xab and 0xbb codes:
--- a/src/talk-llama-fast.cpp
+++ b/src/talk-llama-fast.cpp
@@ -1245,8 +1245,8 @@ int run(int argc, const char ** argv) {
text_heard = RemoveTrailingCharacters(text_heard, '!');
text_heard = RemoveTrailingCharacters(text_heard, ',');
text_heard = RemoveTrailingCharacters(text_heard, '.');
- text_heard = RemoveTrailingCharacters(text_heard, '«');
- text_heard = RemoveTrailingCharacters(text_heard, '»');
+ text_heard = RemoveTrailingCharacters(text_heard, 0xab); // '«'
+ text_heard = RemoveTrailingCharacters(text_heard, 0xbb); // '»'
if (text_heard[0] == '.') text_heard.erase(0, 1);
if (text_heard[0] == '!') text_heard.erase(0, 1);
trim(text_heard);
I'm afraid this is not all, yet. The other Cyrillic symbols in the source file also use Unicode. Therefore, they all are likely to be mishandled in the same way. On the other hand, does LLAMA really output extended ASCII, not Unicode or both?? This code clearly assumes that it is always extended ASCII.
Source files are in utf8 without bom as far as I know.
I am compiling using cmake. No errors for me.
Does llama output in utf8? I don't remember, will take a look later, but most likely - yes.
If the source files are in UTF-8, then text_heard = RemoveTrailingCharacters(text_heard, '«');
will never work as intended, because the Unicode symbol as a char literal does not have a valid representation. As I understand, UTF incorporates the ASCII range, that's why all common symbols like '!' or ',' will work this way, but not '«' and '»', because they are outside of the ASCII range.
These '«' and '»' symbols are called lapki (лапки) and are specific to Russian typographic standard.
Does llama output in utf8? I don't remember, will take a look later, but most likely - yes.
I don't know either, I'm new to LLAMA. Best is to ensure that UTF output is always guaranteed. If it is, RemoveTrailingCharacters
should accept std::string (not char) as the second arg, and remove trailing substrings.
Yes, llama outputs UTF-8.
Yes, RemoveTrailingCharacters was not working with '«' and '»'. I changed it now to RemoveTrailingCharactersUtf8.
Ugh, one line has been lost in translation:
text_heard = RemoveTrailingCharactersUtf8(text_heard, U"«");
Anyway, I cleaned this up on your behalf in this commit: dmikushin@d7f49fa
Kindly please explain what the rest of the patch is doing:
From ed7ea378177fcad6dfac11163b884dfc7d24c94c Mon Sep 17 00:00:00 2001
From: Mozer <mo3rnt@gmail.com>
Date: Sun, 14 Apr 2024 20:32:28 +0300
Subject: [PATCH] What is this patch about?
---
examples/talk-llama/talk-llama.cpp | 45 +++++++++++++++++++-----------
1 file changed, 28 insertions(+), 17 deletions(-)
diff --git a/examples/talk-llama/talk-llama.cpp b/examples/talk-llama/talk-llama.cpp
index a731ffcd8c..4adf461c2b 100644
--- a/examples/talk-llama/talk-llama.cpp
+++ b/examples/talk-llama/talk-llama.cpp
@@ -880,7 +892,7 @@ int run(int argc, const char ** argv) {
int reply_part = 0;
std::string text_to_speak_arr[150];
int reply_part_arr[150];
- bool last_output_has_username = false;
+ bool last_output_has_username = false;
if (whisper_params_parse(argc, argv, params) == false) {
return 1;
@@ -892,17 +904,17 @@ int run(int argc, const char ** argv) {
exit(0);
}
- const std::string fileName{params.xtts_control_path};
- std::ifstream readStream{fileName};
- if(!readStream.good()){
- printf("Warning: %s file not found, xtts wont stop on user speech without it\n", params.xtts_control_path.c_str());
- readStream.close();
- }
- else // control file is ok
- {
- readStream.close();
- allow_xtts_file(params.xtts_control_path, 1); // xtts can play
- }
+ //const std::string fileName{params.xtts_control_path};
+ //std::ifstream readStream{fileName};
+ //if(!readStream.good()){
+ // printf("Warning: %s file not found, xtts wont stop on user speech without it\n", params.xtts_control_path.c_str());
+ // readStream.close();
+ //}
+ //else // control file is ok
+ //{
+ // readStream.close();
+ // allow_xtts_file(params.xtts_control_path, 1); // xtts can play
+ //}
// whisper init
xtts_control_path is now shouldn't be passed in params. xtts_play_allowed.txt is now stored in OS /temp dir. The file check at start-up is no longer needed.
Thanks, landed in dmikushin@403a924