Don't Pickle Your Data

Pretty much every Python programmer out there has broken down at one point and and used the ‘pickle’ module for writing objects out to disk.

The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs. For these reasons pickle is usually the default serialization mechanism in Python, used in modules likes python-memcached.

However, using pickle is still a terrible idea that should be avoided whenever possible.

Pickle is slow

Pickle is both slower and produces larger serialized values than most of the alternatives.

To illustrate this, I put together a simple benchmark comparing pickle to the built in JSON module, the Apache Thrift library, and MessagePack. This benchmark measures the number of objects a second each of these libraries can read and write. The data being serialized here are just randomly generated fake ‘Tweet’ objects containing just four fields:

Serialization Rate

Packed Size

Pickle is the clear underperformer here. Even the ‘cPickle’ extension that’s written in C has a serialization rate that’s about a quarter that of JSON or Thrift. Pickle also produces serialized values that are around double the size of Thrift or MessagePack.

I’ve put the code for this benchmark up on github for those that are interested.

Update Feb 13:

I've changed the graphs above to address a couple of issues people have brought up. The default pickle protocol is slow, so I've added a faster version. I've also changed the JSON/MessagePack benchmark to operate on exactly the same data as Pickle. The effects aren't nearly as strong, but even with both changes Pickle isn't a great option.

Pickle is a security risk

Another reason not to use pickle is that unpickling malicious data can cause security issues, including arbitrary code execution.

An example that the brave and foolish can try is below. Unpickling the data there will open a shell prompt that will delete all the files in your home directory:

data  = """cos
system
(S'rm -ri ~'
tR.
"""

pickle.loads(data)

Thankfully this command will prompt you before deleting each file, but its a single character change to the data to make it delete all your files without prompting (r/i/f/). Even worse, an attacker could use pickle to get remote shell access to your computer.

Its not like this is an unknown issue. The pickle module even comes with a big warning about this right in the documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

But its not always clear that your data hasn’t been altered since you wrote it. Say for instance that an attacker gains access to your network, but can’t yet run any code on your servers. If you are using the default python-memcached bindings, all the attacker has to do is make a network call to your memcache server to set a carefully chosen pickle value, and wait for it to be read back in. Once your Python process reads in the data, whatever code the attacker wants will be running on your server.

Just use JSON

For most common tasks, just use JSON for serializing your data. Its fast enough, human readable, doesn’t cause security issues, and can be parsed in all programming languages that are worth knowing. MessagePack is also a good alternative, I was surprised by how well it performed in the benchmark I put together.

Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects, whereas both JSON and MessagePack have limits on the type of data they can write out. Given the downsides though, its worth writing the little bit of code necessary to convert your objects to a JSON-able form if your code is ever going to be used by people other than yourself.