Emacs相关中文问题以及解决方案

本文尝试收集整理 Emacs 使用过程中的中文相关问题以及解决方案. 欢迎 Emacs 爱好者加 hick 微信 3715397 邀请进只聊 Emacs 话题的微信大群.

changelog :

2019-02-01 增加中文相关网站和聚集地 emacs-china
2018-02-25 Writing GNU Emacs Extensions 中文翻译
2016-09-09 增加清华和 Emacs-China 的 elpa 镜像
2016-01-01 安装包镜像/安装源**大陆服务器
2015-12-27 文件编码问题整理

Emacs 中文基础

Emacs 的各种字符集

Emacs 中的文件编码问题跟其他编辑器不大一样, M-x describe-current-coding-system 能查看 Emacs 当前编码设置的情况, M-x describe-coding-system 能看到所有可选择编码, 除了常见的 utf-8 这样的, dos/unix 这样的换行符也作为了编码的一部分. 作为比较通用的方式, 一般建议没有特殊需要都使用 utf-8 编码, unix 还是 dos 看个人需要.

默认配置下, mode line 最左会显示当前文件编码以及换行符格式, 比如 U(Unix) 标示 UTF-8 编码, UNIX 格式换行符,鼠标点击可以查看更多详情.

Unicad 是一个自动检测文件编码的 mode , 有看到反馈说有时候不能成功检测: Unicad is short for Universal Charset Auto Detector. It is an Emacs-Lisp port of Mozilla Universal Charset Detector.

不同版本的 Emacs 可能设置方法不一样, 推荐下面的设置:

(set-language-environment "UTF-8")
(set-default-coding-systems 'utf-8)
(set-buffer-file-coding-system 'utf-8-unix)
(set-clipboard-coding-system 'utf-8-unix)
(set-file-name-coding-system 'utf-8-unix)
(set-keyboard-coding-system 'utf-8-unix)
(set-next-selection-coding-system 'utf-8-unix)
(set-selection-coding-system 'utf-8-unix)
(set-terminal-coding-system 'utf-8-unix)
(setq locale-coding-system 'utf-8)
(prefer-coding-system 'utf-8)

如果一个非 UTF-8 编码, 比如 GBK 编码的文件打开, 可能 Emacs 会乱码, 这时候 M-x revert-buffer-with-coding-system 选择 gbk 即可.

MS-Windows 环境的 UTF-8 配置

UTF-8 编码在 MS-Windows 环境下是「二等公民」, 特别是与命令行工具进行交互作时, 如果设置了使用 UTF-8 编码会导致 Windows 自带程序输出乱码. Emacs的默认设置已经尝试对此类情况进行了处理, 下面的代码在默认设置的基础上设置 cmdproxy.exe 使用GBK编码, 进一步减少设置 UTF-8 编码的副作用:

 (when (eq system-type 'windows-nt)
   (set-default 'process-coding-system-alist
     '(("[pP][lL][iI][nN][kK]" gbk-dos . gbk-dos)
	("[cC][mM][dD][pP][rR][oO][xX][yY]" gbk-dos . gbk-dos))))

参考 issue#4 jacobsun 的分享, 有可能需要如下设置, 否则可能出现乱码, 暂时未获知具体版本等相关信息, hick 个人的 windows 下设置的只有下面说的一行:

;; jacobsun 分享
(when (eq system-type 'windows-nt)
  (set-next-selection-coding-system 'utf-16-le)
  (set-selection-coding-system 'utf-16-le)
  (set-clipboard-coding-system 'utf-16-le))
;; hick 个人的设置:
(set-default-coding-systems 'utf-8)

中文断行

如果开启了 auto-fill 或者使用 refill-mode 等, Emacs 会根据 fill-column 的大小,在合适的位置插入换行符. 这时候再导出成 html 等格式, 可能导致被自动断行的中文之间多一个空格(中文和英文以及英文与英文之间的空格还是自然的), 一种建议是关掉或者不是用自动断行, 采用 truncate-lines 的方式让 Emacs 自动根据 window 大小处理成显示层的断行.

另外一种比较常见的做法是在导出的时候, 通过导出的过滤器机制, 导出之前去掉中文之间的换行的模式, 具体参考有 zwz 和 tumashu 两种类似的实现:

;;; 下面一段是 Feng Shu 的
(defun eh-org-clean-space (text backend info)
  "在export为HTML时,删除中文之间不必要的空格"
  (when (org-export-derived-backend-p backend 'html)
    (let ((regexp "[[:multibyte:]]")
          (string text))
      ;; org默认将一个换行符转换为空格,但中文不需要这个空格,删除.
      (setq string
            (replace-regexp-in-string
             (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
             "\\1\\2" string))
      ;; 删除粗体之前的空格
      (setq string
            (replace-regexp-in-string
             (format "\\(%s\\) +\\(<\\)" regexp)
             "\\1\\2" string))
      ;; 删除粗体之后的空格
      (setq string
            (replace-regexp-in-string
             (format "\\(>\\) +\\(%s\\)" regexp)
             "\\1\\2" string))
      string)))
(add-to-list 'org-export-filter-paragraph-functions
             'eh-org-clean-space)


;;; 下面一段是 zwz 的, 作者声明只适应 org-mode 8.0 以及以上版本
(defun clear-single-linebreak-in-cjk-string (string)
  "clear single line-break between cjk characters that is usually soft line-breaks"
  (let* ((regexp "\\([\u4E00-\u9FA5]\\)\n\\([\u4E00-\u9FA5]\\)")
         (start (string-match regexp string)))
    (while start
      (setq string (replace-match "\\1\\2" nil nil string)
            start (string-match regexp string start))))
  string)

(defun ox-html-clear-single-linebreak-for-cjk (string backend info)
  (when (org-export-derived-backend-p backend 'html)
    (clear-single-linebreak-in-cjk-string string)))

(add-to-list 'org-export-filter-final-output-functions
             'ox-html-clear-single-linebreak-for-cjk)

中文字体设置

为了保证显示效果, 一般使用中英文等宽字体(一个中文字显示宽度等于俩个英文字母显示宽度), 推荐字体:

文泉驿等宽微米黑优化版.ttf
雅黑mono.ttf
DroidSansFallback.ttf

以上字体都包含中文字体, 可以点百度网盘下载.

日历设置

Emacs 中有日历, 而且可以称之为一个系统, 因为其中除了最常用的日历之外, 还有其他的近十种历法, 其中有日记、约会提醒、纪念日提示以及节假日提示等等. 其中的历法包括**的农历、希伯来历、伊斯兰历、法国革命历、中美玛雅历等等,可以根据经纬度告知你的所在的每天日出日落的时间等等.

Emacs 自带 calc-china.el , 以下为设置中文里的 ‘celestial-stem’ (天干) 和 ‘terrestrial-branch’ (地支):

(setq chinese-calendar-celestial-stem
          ["甲" "乙" "丙" "丁" "戊" "己" "庚" "辛" "壬" "癸"]
          chinese-calendar-terrestrial-branch
          ["子" "丑" "寅" "卯" "辰" "巳" "午" "未" "申" "酉" "戌" "亥"])

设置阳历节日和阴历节日(参考 fog_proxy ):

;;; 补充用法: holiday-float m w n 浮动阳历节日, m 月的第 n 个星期 w%7
(setq general-holidays '((holiday-fixed 1 1   "元旦")
                         (holiday-fixed 2 14  "情人节")
                         (holiday-fixed 4 1   "愚人节")
                         (holiday-fixed 12 25 "圣诞节")
                         (holiday-fixed 10 1  "国庆节")
                         (holiday-float 5 0 2 "母亲节")   ;5月的第二个星期天
                         (holiday-float 6 0 3 "父亲节")
                         ))
(setq local-holidays '((holiday-chinese 1 15  "元宵节 (正月十五)")
                       (holiday-chinese 5 5   "端午节 (五月初五)")
                       (holiday-chinese 9 9   "重阳节 (九月初九)")
                       (holiday-chinese 8 15  "中秋节 (八月十五)")
                       ;; 生日
                       (birthday-fixed 9 28  "爸爸生日(1950)")
                       (birthday-fixed 10 1  "妈妈生日(1953)")
                       (holiday-chinese 5 29 "老婆生日")           ;阴历生日

                       (holiday-lunar 1 1 "春节" 0)
                       ))

另外一种中文阴历节日的 holiday-lunar 的写法参考自: 在emacs Calendar中定制**农历节日

更强大的中文日历工具:

中文输入

外部中文输入法

个人用搜狗中文输入法的还可以

内置的输入法

默认情况下 toggle-input-method 命令切换输入法.

输入法相关 mode

eim-py: An Emacs Input Method extension for smart pinyin
Emacs 中使用五笔输入法: Chinese Wubi (五笔) input method for Emacs based on quail package.
chinese-pyim chinese-pyim是从eim拼音输入法进化来的, 个人感觉比eim拼音输入法好用
Make fcitx better in Emacs.
chinese-remote-input 在emacs中, 通过智能手机输入法（比如：android语音输入法）远程输入中文.
scel2pyim 一个个将搜狗输入法 scel 细胞词库转换为 chinese-pyim 文本词库的小工具.
Gat, Chinese Input Method, works in Emacs

org 的中文问题

table 中英文对齐等

因为 Emacs 处理字体的方式的问题, 即使设置字体为等宽字体(一个中文相当于两个英文宽度), org 中的 table 出现中文经常都无法工整的对齐. 需要分别对中英文字体设置合适的大小. 处理该问题有现成的方案: https://github.com/tumashu/chinese-fonts-setup . 其中默认定义了各个系统平台常见的字体以及中英文字体搭配, 使得 org table 里的出现中文也能很好的对齐. 如果安装好以后显示的字体过大, 可以通过 cfs-increase-fontsize/cfs-decrease-fontsize 调整选择合适的大小.

更多参考资料:

狠狠地折腾了一把Emacs中文字体 BY BAO HAOJUN
折腾 Emacs BY zhuoqiang

org 导出中文 html 的空格问题

严格来说跟 org 没什么关系, 参见上文的中文断行

org 以及 LaTex 等导出中文 pdf

导出中文也分直接转 LaTeX 再转 pdf 以及先转 html 再转 pdf 等各种方式, 中间方案的可以参考这个中文支持不错的pdf工具rst2pdf

arthur@微信群分享的 TeX 解决方案, 用 XeTeX 或者 xetex-tutorial .

中文用户聚集地

Emacs 爱好者微信群

分一个主群和一个闲聊群, 截止 2019-02-01 分别 400 多人和 200 多人.

微信加 3715397 注明 Emacs 可以邀请入群. 特别注意主群约定之聊 Emacs 相关话题, 无关话题转闲聊群.

Emacs-China

主要由子龙山人创建的 , Emacs 讨论社区创建以后比较活跃, 讨论比较有序.

水木社区 Emacs 版

源自水木清华的 telnet BBS 的社区, 目前同时有 web 版和 telnet 版

telegram emacs_zh ***

2020-03 左右上去看过也还比较活跃 telegram emacs_zh

其他中文相关工具

Writing GNU Emacs Extensions 中文翻译

微信群 Emacsist 技术群 slegetank 翻译的 Writing GNU Emacs Extensions 中文版

安装包镜像/安装源**大陆服务器

由于大陆地区特殊的网络条件, 直连国外的 ELPA 服务器可能特别慢甚至有时候安装不成功, 有以下几个国内镜像推荐大家使用.

“清华大学 TUNA 协会原名清华大学学生网管会，注册名清华大学学生网络与开源软件协会” 搭建的镜像, 因为有清华的网络后盾, 比较推荐这个, 其中也包括 marmalade 和 org 等几个其他 Emacs 包的镜像:

https://mirrors.tuna.tsinghua.edu.cn/elpa/

国内的 Emacs 爱好者搭建的 Emacs-China (作为 Emacs 专属社区也是一个很好的地方) 的镜像也跟上面的类似, 还包括简单的使用方法说明:

http://elpa.emacs-china.org/

@aborn 有搭建的镜像, ELPA 的 EmacsWiki 上也有相关说明 :

(add-to-list 'package-archives
          '("aborn" . "http://elpa.popkit.org/packages/"))

顺带提一句, 如果 M-x package-install 出现找不到包的 url , 可能是本地缓存的包地址已经升级变换, M-x package-refresh-contents 可能就可以了.

ace-pinyin

https://github.com/cute-jumper/ace-pinyin

Jump to Chinese characters using ace-jump-char-mode or avy-goto-char : input the first letter of the pinyin of the Chinese character, then use ace-jump-char-mode or avy-goto-char to jump to it.

cedict

https://github.com/danmey/cedict.el

Emacs interface to Chinese-English dictionary in CEDICT format.

chinese-word-at-point

https://github.com/xuchunyang/chinese-word-at-point.el

Get (most likely) Chinese word under the cursor in Emacs

中文分词跟英文可以时候完全不是一回事, 徐春阳同学弄的这个, 依赖外部分词的命令行: 可以用结巴分词或者 SCWS (简易中文分词系统).

chinese-fonts-setup

原地址 https://github.com/tumashu/chinese-fonts-setup 已经更新为 https://github.com/tumashu/cnfonts

emacs中文字体配置工具. 可以快速方便的的实现中文字体和英文字体等宽（也就是常说的中英文对齐）

chinese-pyim-bigdict

https://github.com/tumashu/chinese-pyim-bigdict

这个文件是一个 Chinese-pyim 拼音词库文件, 词量超过100万, 词库大于20M, 这个词库仅供个人使用.

Chinese in Spacemacs

子龙山人给 Spacemacs 贡献了一个中文 layer

另外还有 et2010 也有一个稍有差别的中文处理 lay: https://github.com/et2010/Chinese

emacs-hz2py

https://github.com/kawabata/emacs-hz2py

Hanzi to Pinyin converter for Emacs

find-by-pinyin-dired

https://github.com/redguardtoo/find-by-pinyin-dired

Find file by first Pinyin characters of Chinese Hanzi. 输入拼音首字母定位对应的中文目录/文件

ido-pinyin

https://github.com/pengpengxp/ido-pinyin

Make ido support chinese pinyin

magit 的中文问题

按照前面的设置好编码一般不会有问题了. 有收到一种情况是 linux 下的终端的问题, 有网友这样尝试解决了:

vi /etc/profile
# 添加
export LESSCHARSET=utf-8
# 填完以后执行
source /etc/profile

pangu-spacing

https://github.com/coldnew/pangu-spacing

emacs minor-mode to add space between Chinese and English characters.

看演示 gif 挺好玩.

pinyin-search

https://github.com/xuchunyang/pinyin-search.el

Search Chinese by the first letter of Chinese pinyin.

pyim-hanzi2pinyin

是一个汉字转拼音得函数, 包含在chinese-pyim中, 主要用于生成词库 @tushuma 天然二呆

ta.el 同音字处理

**的 kuanyui 写的处理同音字的 mode , https://github.com/kuanyui/ta.el

中文分词工具

emacs-chinese-word-segmentation : 基于 cjieba 的中文分词工具。实现了以中文词语为单位的移动和编辑。

欢迎补充

参考和鸣谢

本文档由 hick 初始整理, 主要是在 Emacs 微信群中 @求其 @arthur @子龙山人 @peng 等讨论中文 org 中 table 中英文混排对齐的时候, 发现有各种做法, 引发整理中文问题的想法.

特别鸣谢:

zklhp 补充 windows 环境的处理
本文目录使用 toc-org 自动生成, 安装好以后每次保存会自动刷新目录

欢迎提议和补充条目.

hick/emacs-chinese