/Hexsed

A stream editor that substitutes or deles byte patterns using hex values

Primary LanguageCGNU General Public License v3.0GPL-3.0

README for hexsed.
Hexsed is a stream editor that uses hex values in the find string and
the replacement string if any.
There are 4 commands: 's' substitute, 'd' delete, 'a' append, and 'i'
insert. The only permissible field separator is '/'.
The hex values must be input as a 2 char string, eg 0A, 01, not A or 1.
It operates on a named file and writes to stdout.
There are options provided to generate the hex symbols for ascii chars,
escape sequences, decimal digits and octal digits. Also you may use a
NULL terminated string which may optionally include escaped sequences.

Why I wrote it.
I had a PDF supplied that was in A4 landscape format and was
prodigiously wasteful of space, so I decided that I would copy it off
the PDF reader display it and paste it into LibreOfficeCalc with the
intention of making an A4 portrait PDF from it.

However in LO all spaces between words using Latin chars got lost.
Pasting into the editor Geany got me nothing sensible but in Gvim it
was quite useable. Though the display was fine, Gvim did not allow me to
do any global replacements so I next tried Ghex. That was good for
showing me the offending hex chars but I found it very cumbersome to use
as an editor. In any case I prefer a stream editor for such a job.

NB, using geany version 1.27 I find that it can handle <U+2028>, which
it now displays as 'LS' just fine, including doing global replacements.
However hexsed is still my prefered tool for this kind of job.

Below is how a small file written by Gvim shows in `less`.
Hello<U+2028>
Good-bye<U+2028>
My<U+2028>name<U+2028>is...<U+2028>
What<U+2028>is<U+2028>your<U+2028>name?<U+2028>
สวัสดี ค่ะ /ครับ <U+2028>
ลาก่อน นะ ค่ะ /ครับ <U+2028>
ดิฉัน /ผม ชื่อ ... ค่ะ /ครับ <U+2028>
คุณ ชื่อ อะไร คะ/ครับ <U+2028>
sa-wàt-dee<U+2028>kâ/kráp<U+2028>
laa<U+2028>gòn<U+2028>ná<U+2028>kâ/kráp<U+2028>
dì-chăn<U+2028>chêu...<U+2028>kâ/kráp<U+2028>
koon<U+2028>chêu<U+2028>a-rai<U+2028><U+2028>ká/kráp<U+2028>

As it happens the unicode expression <U+2028> comprises 3 bytes
containing E2 80 A8.
So running hexsed as below dealt with all these artifacts.
`hexsed /E280A80A/0A/s tmpfil > x`	# deal with \n
mv x tmpfil
`hexsed /E280A8/20/s tmpfil > x`	# replace them with ' '
mv x tmpfil
After reloading Gvim as prompted, the text it showed was now useable in
LOCalc.

As well as doing the stream editing, hexsed has options that will
generate the string of hex for you from input chars, escape sequences
and strings. The strings may contain escape sequences.
Eg hexsed -s '</p><p>' yields
3C2F703E3C703E
and hexsed -s '</p>\n<p>' yields
3C2F703E0A3C703E
so then,
hexsed /3C2F703E3C703E/3C2F703E0A3C703E/s some.html
will put a linefeed between all such tag pairs in some.html .
See man 1 hexsed .