datalad_next.itertools.decode_bytes
- datalad_next.itertools.decode_bytes(iterable: Iterable[bytes], encoding: str = 'utf-8', backslash_replace: bool = True) Generator[str, None, None][source]
Decode bytes in an
iterableinto stringsThis function decodes
bytesorbytearrayintostrobjects, using the specified encoding. Importantly, the decoding input can be spread across multiple chunks of heterogeneous sizes, for example output read from a process or pieces of a download.Multi-byte encodings that are spread over multiple byte chunks are supported, and chunks are joined as necessary. For example, the utf-8 encoding for ö is
b'\xc3\xb6'. If the encoding is split in the middle because a chunk ends withb'\xc3'and the next chunk starts withb'\xb6', a naive decoding approach like the following would fail:>>> [chunk.decode() for chunk in [b'\xc3', b'\xb6']] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data
Compared to:
>>> from datalad_next.itertools import decode_bytes >>> tuple(decode_bytes([b'\xc3', b'\xb6'])) ('ö',)
Input chunks are only joined, if it is necessary to properly decode bytes:
>>> from datalad_next.itertools import decode_bytes >>> tuple(decode_bytes([b'\xc3', b'\xb6', b'a'])) ('ö', 'a')
If
backslash_replaceisTrue, undecodable bytes will be replaced with a backslash-substitution. Otherwise, undecodable bytes will raise aUnicodeDecodeError:>>> tuple(decode_bytes([b'\xc3'])) ('\\xc3',) >>> tuple(decode_bytes([b'\xc3'], backslash_replace=False)) Traceback (most recent call last): ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
Backslash-replacement of undecodable bytes is an ambiguous mapping, because, for example,
b'\xc3'can already be present in the input.- Parameters:
iterable (Iterable[bytes]) -- Iterable that yields bytes that should be decoded
encoding (str (default:
'utf-8')) -- Encoding to be used for decoding.backslash_replace (bool (default:
True)) -- IfTrue, backslash-escapes are used for undecodable bytes. IfFalse, aUnicodeDecodeErroris raised if a byte sequence cannot be decoded.
- Yields:
str -- Decoded strings that are generated by decoding the data yielded by
iterablewith the specifiedencoding- Raises:
UnicodeDecodeError -- If
backslash_replaceisFalseand the data yielded byiterablecannot be decoded with the specifiedencoding