HOST-Oman/scribus

Default_Ignorable_Code_Point in spell checking and hyphenation?

Closed this issue · 19 comments

Some characters in Unicode have the property Default_Ignorable_Code_Point. This includes characters like the soft hyphen U+00AD and the ZWNJ zero width non-joiner U+200C. Most format characters have this property. With CTL and OpenType support available in Scribus, the usage of these characters will increase.

The question is how to treat these characters in spell checking and hyphenation.

Spell checking

Currently, the spell checking code seems to read words with Default_Ignorable_Code_Point within, but it applied the correction only to the characters before the first Default_Ignorable_Code_Point. Result: The second part of the word is duplicated. This is always wrong.

Possible solution:

  • It might be possible to ignore all words that contain Default_Ignorable_Code_Point. Disadvantage: Is not what the user might expect.
  • A better solution might be to strip all Default_Ignorable_Code_Point characters before passing a word to the spell checking engine. To avoid that – when adopting a correction proposal – the Default_Ignorable_Code_Point-s are lost, all correction proposals should be hidden when the original word contained Default_Ignorable_Code_Point.

Hyphenation

Currently, the automatic hyphenation does not work as expected when characters like ZWNJ are present. Example: The German word “Auflage” should be hyphenated “Auf-la-ge”, and this works correctly in Scribus when no ZWNJ is there. But when there is a ZWNJ between “Auf” and “lage” then the first hyphenation point is not found. It’s only “Aufla-ge”.

Possible solution:

  • Do nothing. (At least the current situation does not create much harm.)
  • Support Default_Ignorable_Code_Point characters as expected. Might have more overhead (mapping of character positions) but might be the best solution.

Hi @sommerluk could you please provide a sample file for the problem?

by 56b2521 & 7db0ad3 should Spell checking part be fixed. @sommerluk please test.

@sommerluk if things is working fine with you, then we done here because I prefer the first solution until some body brave enough to implement #145.

I am closing this now, if you have a problem please fill a new bug report.

Sorry for answering late. I did not forget it, but building scribus-ctl from source did not work for me and I could not figure out how to run the AppImage (openSUSE Leap within VirtualBox) either:

realPath called with a relative path './share/pixmaps/', please fix
realPath called with a relative path './share/icons/', please fix
pci id for fd 11: 80ee:beef, driver (null)
libGL error: core dri or dri2 extension not found
libGL error: failed to load driver: vboxvideo
pathForIcon: Unable to load icon ././/share/scribus/icons/1_5_1/AppIcon.png: File not found
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
ImportError: No module named site
Scribus Crash
-------------
Scribus crashes due to Signal #11
Speicherzugriffsfehler

So, I could not actually test it. If there is an easy way to get this working, I can make some testing.

However, two observations about the patch:

  1. The list of the two characters (ZWNJ and SOFT HYPHEN) is duplicated at two different places. This might be dangerous because in the future, somebody who does not know that these two lists must be kept synchronized, could change only one of these two lists while leaving the other one unchanged, and this would lead to unexpected results.
  2. The list contains only two characters. I’m confident that this is enough for German typesetting (at least at 99,9%) and probably all Latin scripts. I’m not so sure for other scripts.

The complete list of Default_Ignorable_Code_Point in Unicode 9 from http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt is:

# ================================================

# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 08E2, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173

# ================================================

I do not have the knowledge to tell, but maybe ZWJ and the script specific characters (Arabic, Hangul, Khmer, Mongolian) might be interesting. Would it be overkill to simply use the whole list?

@sommerluk could you try this solution for appimage?

I reopen this for now.

It does not work. But the error message has changed:

:~> ./scribus-git217b3eb-glibc2.14-x86-64.appimage
realPath called with a relative path './share/pixmaps/', please fix
realPath called with a relative path './share/icons/', please fix
pci id for fd 11: 80ee:beef, driver (null)
libGL error: core dri or dri2 extension not found
libGL error: failed to load driver: vboxvideo
pathForIcon: Unable to load icon ././/share/scribus/icons/1_5_1/AppIcon.png: File not found
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site.py", line 564, in <module>
    main()
  File "/usr/lib64/python2.7/site.py", line 546, in main
    known_paths = addusersitepackages(known_paths)
  File "/usr/lib64/python2.7/site.py", line 276, in addusersitepackages
    user_site = getusersitepackages(kind)
  File "/usr/lib64/python2.7/site.py", line 244, in getusersitepackages
    user_base = getuserbase() # this will also set USER_BASE
  File "/usr/lib64/python2.7/site.py", line 230, in getuserbase
    from sysconfig import get_config_var
  File "/usr/lib64/python2.7/sysconfig.py", line 10, in <module>
    'stdlib': '{base}/'+sys.lib+'/python{py_version_short}',
AttributeError: 'module' object has no attribute 'lib'
Scribus Crash
-------------
Scribus crashes due to Signal #11
Speicherzugriffsfehler

@probonopd any help in above problem?

@sommerluk what is the problem with building CTL branch?

Looks like it is having trouble finding a path.

A real solution might be to change the upstream code to be fully relocateable, i.e., never use absolute paths that are compiled in at compilation time. See https://github.com/limbahq/binreloc for more information.

@sommerluk what is the problem with building CTL branch?

cmake .
-- Shared Library Flags: 
-- Scribus 1.5.3.svn will be built and installed into /usr/local
-- Machine: x86_64-suse-linux, void pointer size: 8
-- Found target X86_64
-- Building for target x86_64-suse-linux
-- Using standard ApplicationDataDir. You can change it with -DAPPLICATION_DATA_DIR
-- ----- USE QT 5-----
-- ----- USE QT Widgets-----
-- ----- USE Qt5Gui -----
-- ----- USE QT 5 XML -----
-- ----- USE Qt5Network -----
-- ----- USE Qt5OpenGL -----
-- ----- USE Qt5LinguistTools -----
-- ----- USE Qt5Quick -----
-- ----- USE Qt5PrintSupport -----
-- Qt VERSION: 5.5.1
ZLIB Library Found OK
No OSG found, building without 3D Extension
JPEG Library Found OK
TIFF Library Found OK
Python Library Found OK
-- FreeType2 Library Found OK
CAIRO Library Found OK
CUPS Library Found OK
LIBXML2 Library Found OK
LCMS 2 ReleaseLibrary: /usr/lib64/liblcms2.so
LCMS 2 Debug Library: LCMS2_LIBRARY_DEBUG-NOTFOUND
LCMS 2 Library: /usr/lib64/liblcms2.so
LittleCMS-2 Library Found OK
FontConfig Found OK
-- Could NOT find HUNSPELL (missing:  HUNSPELL_LIBRARIES HUNSPELL_INCLUDE_DIR) 
Hunspell or its developer libraries NOT found - Disabling support for spell checking
PoDoFo NOT found - Disabling support for PDF embedded in AI
-- Boost version: 1.54.0
Boost Library Found OK
Building without GraphicksMagick (use -DWANT_GRAPHICSMAGICK=1 to enable)
-- Found poppler
-- Found poppler libs: /usr/lib64/libpoppler.so
-- Found poppler includes: /usr/include/poppler
-- checking for module 'librevenge-0.0'
--   package 'librevenge-0.0' not found
RPATH: lib/scribus/plugins/;
-- Qt5::CoreQt5::WidgetsQt5::GuiQt5::XmlQt5::NetworkQt5::OpenGL/usr/lib64/libxml2.so/usr/lib64/libz.so
-- checking for module 'libwpg-0.2'
--   package 'libwpg-0.2' not found
-- checking for module 'libmspub-0.0<=0.1'
--   package 'libmspub-0.0<=0.1' not found
-- checking for module 'libwpg-0.2'
--   package 'libwpg-0.2' not found
-- Building with Scripter 1
-- No source header files will be installed
-- /home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/resources/translations
-- The following GUI languages will be installed: 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy

works fine.

But then:

> make
[  0%] Built target scribus_zip_lib
[  1%] Built target scribus_colormgmt_lib
[  2%] Built target scribus_desaxe_lib
[  2%] Built target scribus_fonts_lib
[  3%] Built target scribus_styles_lib
[  3%] Building CXX object scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp: In member function ‘uint RunIndex::search(int) const’:
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp:17:66: error: ‘const class std::vector<unsigned int>’ has no member named ‘cbegin’
  std::vector<uint>::const_iterator it = std::upper_bound(runEnds.cbegin(), runEnds.cend(), pos);
                                                                  ^
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp:17:84: error: ‘const class std::vector<unsigned int>’ has no member named ‘cend’
  std::vector<uint>::const_iterator it = std::upper_bound(runEnds.cbegin(), runEnds.cend(), pos);
                                                                                    ^
scribus/text/CMakeFiles/scribus_text_lib.dir/build.make:69: recipe for target 'scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o' failed
make[2]: *** [scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o] Error 1
CMakeFiles/Makefile2:448: recipe for target 'scribus/text/CMakeFiles/scribus_text_lib.dir/all' failed
make[1]: *** [scribus/text/CMakeFiles/scribus_text_lib.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

@sommerluk it seems that you are using compiler doesn't support c++ 11. add this CXX='g++ -std=c++11' before cmake command as following:
CXX='g++ -std=c++11' cmake . -DCMAKE_INSTALL_PREFIX=/usr -DWANT_DEBUG=1

or use Qt 5.7 which will require to use c++11.

@Fahad-Alsaidi Thanks. Compiling works now.

Tested the commit. Spell checking works fine for german test cases.

Nevertheless, I would like to hear what you think about the two points that I mentioned in a previous comment:

  1. The list of the two characters (ZWNJ and SOFT HYPHEN) is duplicated at two different places. This might be dangerous because in the future, somebody who does not know that these two lists must be kept synchronized, could change only one of these two lists while leaving the other one unchanged, and this would lead to unexpected results.

  2. The list contains only two characters. I’m confident that this is enough for German typesetting (at least at 99,9%) and probably all Latin scripts. I’m not so sure for other scripts.

The complete list of Default_Ignorable_Code_Point in Unicode 9 from http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt is:

# ================================================

# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 08E2, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173

# ================================================

I do not have the knowledge to tell, but maybe ZWJ and the script specific characters (Arabic, Hangul, Khmer, Mongolian) might be interesting. Would it be overkill to simply use the whole list?

@sommerluk I agree with you. please look at 930000e , if it looks fine please close this bug.

930000e looks good to me. I’ve compiled it and tested it, and it works fine.

Thanks a lot for the efford!

Closing this issue. Later I’ll open a new one the the hypenation part only…