Handling binary documents having ASCII-compatible markup

Python3 has bytes, which are sequences of 8-bit integers, and str, sequences of Unicode code-points. To go from one to the other, you need to encode or decode by giving an explicit encoding. There are many protocols where the markup is ASCII, even if the data is some other encoding that you don't know. If you know that other encoding is ASCII-compatible, it is useful to be able to parse, split etc. the markup, and you just need to pass-through the payload.

An initial search on the Internet brought up an article by Eric S. Raymond that touches on that, and suggests to decode the data as ISO-8859-1, handle it as str, then the payload can be losslessly recovered by reencoding it. The first 256 codepoints of Unicode are exactly the ISO-8859-1 codepoints (a bit more on that further down). As a result, the following is idempotent:

>>> by = bytes(range(0xff))
>>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")
>>> by == by2
True

A few days later I came across that article by Alyssa Coghlan, which mentions the same idea, and also the existence of the errors="surrogateencoding" (PEP 383) error handler (also: codecs in the Python documentation), which is designed to allow exactly what I needed:

>>> by = bytes(range(0xff))
>>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")
>>> by == by2
True

Alyssa Coghlan has some discussion about the merits of each approach, I can't really say that functionally they have any meaningful difference. As she points out, if you do anything but re-encode the non-ASCII compatible parts using the same codec, you risk getting Mobijake (or if you're lucky, a UnicodeDecodeError rather than silently producing garbage).

Performance-wise, let's see:

>>> import timeit
>>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)")
0.8885893229962676
>>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)")
125.00223343299876

That's… a very large difference. ESR's article points out that ISO-8859-1 has some properties that make it some efficient (it maps bytes 0x80—0xff to Unicode code-points of the same numeric value, so there is no translation cost, and the in-memory representation is more efficient). Trying increasing sizes:

>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000:
...     by = random.randbytes(size)
...     duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000)
...     print(f"{size}\t{duration}")
... 
10	0.0650910490003298
100	0.1047916579991579
1000	0.5472217770002317
2000	1.5103355319952243
5000	5.779067411000142
10000	12.497241530996689
20000	25.78209423399676

That seems to grow faster than O(n); the duration seems to be ×3 when the size goes ×2, is it growing like O(n1.5)?

In contrast, using the ISO-8859-1 method seems to having a complexity O(n):

>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000:
...     by = random.randbytes(size)
...     duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000)
...     print(f"{size}\t{duration}")
... 
10	0.05453772499458864
100	0.037617702000716235
1000	0.05454556500626495
2000	0.05654650100041181
5000	0.06352802200126462
10000	0.0898260960020707
20000	0.23981017799815163
50000	0.4997737009980483
100000	0.9646763860000647

By design of Unicode, the first 256 codepoints are the same as those of ISO 8859-1, and the conversion between Unicode and ISO 8859-1 is implemented in C in CPython. That's why the technique describes above works losslessly, and is so much faster than any mapping to surrogate escapes (even if I still don't understand how that latter approach is so slow).

I recently learned (between $WORK making me investigate character encoding issues, and making a character table explorer as a JavaScript learning exercise) about the C1 controls characters. Wikipedia has a write-up, in short in 1973, during their work on 8-bit encodings for languages beyond English, ECMA and ISO came up with an extra series of control characters in the 0x80--0x9F range, which would allow things like terminal control (the 0x9B Control Sequence Introducer for ANSI escape sequences), switching to a different encoding (eg. SS2/SS2, used by EUC-JP), etc. However, it would still be necessary to represent those in a 7-bit enviroment; so each of those could also be written as 0x1B ␛ followed by an ASCII char (eg. Control Sequence Introducer could be spelled 0x1B [). Both of the following print HI in reverse video with xfce4-terminal and libvte-2.91 v0.70.6, but the second example doesn't work in XTerm version 379. I guess XTerm doesn't (or dropped) support for the C1 controls spelling of the ANSI escape sequences.

$ printf "\\u001b[7m HI \\u001b[m Normal\n"
HI  Normal
$ printf "\\u009B7m HI \\u009Bm Normal\n"
HI  Normal

Anyway, those C1 controls were almost never used, but are reserved in the ISO 8859-X encodings. Windows-1252, presumably noticing that lack of use, assigned printable glyphs at those positions. HTML5 aliases ISO 8859-1 to Windows-1252, I guess because it wouldn't make sense to use control charactes in an HTML document, so it those appear, that must be because the author actually meant Windows-1252 (and if they don't, then ISO 8859-1 and Windows-1252 are identical outside the range of C1 controls.

Auteur : Frédéric Perrin

Date : vendredi 9 août 2024, modifié le lundi 21 octobre 2024

Sauf mention contraire, les textes de ce site sont sous licence Creative Common BY-SA.

Ce site est produit avec des logiciels libres 100% élevés au grain et en plein air.