How to replace the ' \\ ' with ' \ ' in a bytes file?

Refresh

November 2018

Views

166 time

1

Here is my problem: I have an encrypted bytes file like this:

[w\x84\[email protected]\xc6\xab\xc8

I want to decrypt it with PyCrypto, but I find a confused bug like the following:

Code here:

from Crypto.Cipher import DES
key = 'jsfghutp'
cipher = DES.new(key, DES.MODE_ECB)
s = open('d:/Thu-Aug-2018.bin','rb').read()

cipher.decrypt(s)

If I run this, it will throw an error:

ValueError                                Traceback (most recent call last)
<ipython-input-3-4fcf0e8076ca> in <module>()
----> 1 cipher.decrypt(s)

D:\Python\anaconda\lib\site-packages\Crypto\Cipher\blockalgo.py in 
decrypt(self, ciphertext)
    293             return res
    294 
--> 295            return self._cipher.decrypt(ciphertext)
    296 

ValueError: Input strings must be a multiple of 8 in length

I print the value of s:

s =  b'[w\\x84\\[email protected]\\xc6\\xab\\xc8'

However this is not right, what I need is the following result:

>>> cipher.decrypt(b'[w\x84\[email protected]\xc6\xab\xc8')
b'test aaa'

That is, I think I must replace the \\\\ with \ in the bytes file, but I failed to do it with a correct way. Does anyone knows how to solve this?

2 answers

1

It happens because the file contains a textual representation of a bytes string ([w\x84\[email protected]\xc6\xab\xc8), but not actually the bytes themselves. You whether write the file properly:

with open('/tmp/file', 'wb') as f:
    f.write(b'[w\x84\[email protected]\xc6\xab\xc8')

Then you won't have issues with reading it:

>>> with open('/tmp/file', 'rb') as f: f.read()
<<< b'[w\x84\[email protected]\xc6\xab\xc8'

Or interpret the representation saved in your file via ast.literal_eval, though this is really unadvisable in this case.

The bottom line is: always know what types you're operating with - strings (unicode) or bytes, and keep in mind that when you print bytes in the console, you see the representation (this \xa0-like stuff), not the bytes themselves, because some bytes don't have a printable form.

3

There is no double-backslash in your file. When you look at the repr of a bytes object, it shows all backslashes escaped, to avoid confusion between, e.g., \n (a newline) and \\n (a backslash followed by an n).

For example:

>>> s = rb'\x84'
>>> s
b'\\x84'
>>> s[0]
92
>>> chr(s[0])
'\\'

So, the problem you're asking about doesn't exist. There are only single backslashes in your file.


The actual problem is that you didn't want the four bytes backslash, x, 8, and 4, you wanted the single byte b'\x84', aka chr(0x84). But the four bytes are what you have in your file.

So your bug is in whatever code you used to create this file. Somehow, instead of dumping the bytes to the file, you dumped a backslash-escaped string representation of those bytes. The right place to fix it is in the code that created the file. Not writing corrupt data is always better than writing corrupt data, and then trying to figure out how to uncorrupt it.

But if it's too late for that—e.g., if you've used that broken code to encrypt a bunch of plaintext that you no longer have access to, and now you need to try to recover it—then this transformation happens to be reversible. You just have to do it in two steps.


First, you decode the bytes with a backslash-escape or more general unicode-escape codec:

>>> s=rb'[w\x84\[email protected]\xc6\xab\xc8'
>>> s
b'[w\\x84\\[email protected]\\xc6\\xab\\xc8'
>>> s.decode('unicode-escape')
'[w\x84\[email protected]Æ«È'

Then you turn each Unicode character into the byte matching the same number, either explicitly:

>>> bytes(map(ord, s.decode('unicode-escape')))
b'[w\x84\[email protected]\xc6\xab\xc8'

… or, somewhat hackily, by relying on Python's interpretation of Latin-1:1

>>> s.decode('unicode-escape').encode('latin-1')
b'[w\x84\[email protected]\xc6\xab\xc8'

Again, those backslashes aren't actually in the string, that's just how Python represents a bytes. For example, if you put that in b, hex(b[2]) is 0x84 for byte \x84, not 0x5c for the backslash character.


Your creation code is the real problem:

with open(file,'a') as f:

    f.write(str(encrypt_text).split("b'")[1].split("'")[0])
    f.close()

You’re converting the bytes to their string representation—with the b prefix, the quotes around it, and the backslash escaping for every byte that isn’t printable ASCII, then stripping off the b and the quotes, then encoding the whole thing as UTF-8 by writing it to a text-mode file.

What you want to do is just open the file in binary mode and write the bytes to it:

with open(file, 'ab') as f:
    f.write(encrypt_text)

(Also, you don't want to call f.close(); the with statement already takes care of that.)

Then you can read the file in binary mode and just decrypt the bytes as-is.

(Or, if you really want the file to be human-editable or something, you want to pick a format that’s designed to be human-editable and easily reversible, like hexlify or base64, not “whatever Python does to represent bytes objects for debugging”.)


1. Unicode is guaranteed to line up with Latin-1 for all characters in Latin-1. Python interprets that to mean that Latin-1 should encode every byte from 0-255 as code point 0-255, instead of just the ones actually defined in ISO-8859-1. Which is valid, since ISO-8859-1 doesn't say what to do with bytes that it doesn't define, but not every tool will agree with Python.