Finding secrets by decompiling Python bytecode in public repositories

tl;dr: Cache rules everything around me. pyc files can contain secrets and should not be checked in to source control. Use the standard Python .gitignore.

This post has a Russian translation.

When you import a Python file for the first time, the Python interpreter will compile it and cache the resulting bytecode in a .pyc file so that subsequent imports don’t have to deal with the overhead of parsing or compiling the code again.

It’s also common practice for Python projects to store configuration, keys, and passwords (collectively referred to as “secrets”) in a gitignored Python file named something like secrets.py, config.py, or settings.py, which other parts of the project import. This provides a nice separation between secrets and source code that gets checked in, and for the most part, this kind of setup works well. And because it reuses the language’s import mechanism, these projects don’t have to fuss around with file I/O or formats like JSON.

But for the same reason that this pattern is fast and convenient, it is also potentially insecure. Because it reuses the language’s import mechanism, which has a habit of creating and caching .pyc files, those secrets also live in the compiled bytecode! Some initial research using the GitHub API reveals that thousands of GitHub repositories contain secrets hidden inside their bytecode.

Existing tools for finding secrets in repositories (my favorite is trufflehog) skip over binary files like .pyc files, and instead only scan plain text files such as source code or configuration files.

Consider donating to a local community bail fund.

Your money will pay for legal aid and bail for protestors who have been arrested for standing up to police brutality, institutional racism, and the murder of Black men and women like George Floyd, Breonna Taylor, Ahmaud Arbery, and Nina Pop.

In the tech community, we talk a lot about inclusivity and diversity. Now is the time to take concrete action.

https://www.communityjusticeexchange.org/nbfn-directory

A crash-course on cached source

Earlier versions of Python stored these files next to the original source files, but beginning with Python 3.2, these files all live in a folder called __pycache__ at the root of the imported module.

Suppose we had a Python file containing this secret password:

SECRET_KEY = "Green eggs and ham"

The bytecode corresponding to that line of code looks like this:

0 LOAD_CONST               1 ('Green eggs and ham')
2 STORE_FAST               0 (SECRET_KEY)

Note that the variable name and string are reproduced in their entirety! Further, it turns out that Python bytecode often contains enough information to recover the original structure of the code. Tools like uncompyle6 can translate .pyc files back into their original forms.*most of the time

$ uncompyle6 secrets.cpython-38.pyc

# uncompyle6 version 3.6.7
# Python bytecode 3.8 (3413)
# Decompiled from: Python 3.8.2 (default, Apr  8 2020, 14:31:25) 
# [GCC 9.3.0]
# Embedded file name: secrets.py
# Compiled at: 2020-05-12 17:16:29
# Size of source mod 2**32: 34 bytes
SECRET_KEY = 'Green eggs and ham'
# okay decompiling secrets.cpython-38.pyc

Caching out

To investigate just how widespread this problem was, I wrote a short script to search GitHub for .pyc files and decompile them to look for secrets. I ended up finding thousands of Twitter keys, Stripe tokens, AWS credentials, and social media passwords. I alerted any organizations whose keys I found this way.

import base64
import io
import os
import tempfile
import uncompyle6
from github import Github

GITHUB_KEY = os.environ.get("GITHUB_KEY")

g = Github(GITHUB_KEY)
items = g.search_code("filename:secrets.pyc")
for item in items:
    print(f"DECOMPILING REPO https://github.com/{item.repository.full_name}")
    print(f"OWNER TYPE: {item.repository.owner.type}")
    try:
        contents = base64.b64decode(item.content)
        with tempfile.NamedTemporaryFile(suffix=".pyc") as f:
            f.write(contents)
            f.seek(0)

            out = io.StringIO()
            uncompyle6.decompile_file(f.name, out)
            out.seek(0)
            print(out.read())
    except Exception as e:
        print(e)
        print(f"COULD NOT DECOMPILE REPO https://github.com/{item.repository.full_name}")
        continue
    print("\n\n\n")

Try this out yourself!

This post comes with a small capture-the-flag style lab for you to try out this style of attack yourself.

You can find it at https://github.com/veggiedefender/pyc-secret-lab/

Takeaways

Cached bytecode is a low-level internal performance optimization, which is the kind of thing Python was supposed to free us from having to think about! The contents of .pyc files are inscrutable without special tools like a disassembler or decompiler. And when these files are buried inside __pycache__ (the double underscores signal “keep out; internal use only”), they’re easy to overlook. Many text editors and IDEs hide these folders and files from the source tree to avoid cluttering up the screen, making it easy to forget that they even exist.

That is to say, it is very easy for an experienced programmer to accidentally commit their secrets, and all but guaranteed that a beginner will make this mistake. Avoiding this requires either getting lucky with a good gitignore, or intermediate knowledge of git and Python internals.

Actionable items you can do:

  • Look through your repositories for loose .pyc files, and delete them
  • If you have .pyc files and they contain secrets, then revoke and rotate your secrets
  • Use a standard gitignore to prevent checking in .pyc files
  • Use JSON files or environment variables for configuration