Mining for Dependencies in Python

Alex McKenzie published on August 24, 2019

7 min, 1330 words

In this blog post I'll outline some techniques and difficulties for figuring out what dependencies a python project might have, if they aren't specified. It ends with my recommendation. (TL;DR: use pigar.)

What are good use cases for this process? Perhaps you have inherited a legacy python script, and you would like to be able to work on it but you can't quite get it to run. Perhaps it doesn't have a requirements.txt, Pipfile, setup.py, setup.cfg or pyproject.toml (this list makes me sad), or perhaps it does but it's broken. Perhaps you wrote it 2 years ago and the required packages are some subset of pip list. Perhaps once upon a time it ran on your computer but now, several pip upgrades later, it simply won't start. These are all legitimate reasons to try and figure out the dependencies of some python code.

What isn't a good use case? Starting a new project. As we will explore below, Python's dynamicity and introspection mean that it is impossible in theory and impractical in practice to find all the packages imported by an arbitrary python script without actually running it. This is especially not OK given there are several better options when starting a Python project in 2019. Here's just one method you might choose:

Install poetry with pip install --user poetry
Create the folder my_library using poetry new my_library
If you need the package numpy while working on my_library, run poetry add numpy
Run a script my_file.py with poetry run python my_file.py

Poetry, a super tool, will take care of all dependencies and save them in the standardised pyproject.toml, saving the specific versions it installed to poetry.lock. There's lots of cool stuff to discover in the documentation. If you're not a fan of poetry, there are other tools out there (primarily pipenv, but there are also a lot of people who still use virtualenv + virtualenvwrapper). All Python IDEs have support for virtualenvs (either natively or via plugins). Whatever approach you take it'll be a million times better than having to rely on something like this blog post!

A Naive Approach

OK, so we have our legacy_project folder, and we wish to resolve its dependencies. Surely the easiest thing to do would be to search the code for import <thing> or from <package> import <thing>, then just try pip install each of the matched imports. Something like the following:

import re
import subprocess
import sys

from pathlib import Path

def naive_build_package_list(dir_path):
    python_files = Path(dir_path).glob('**/*.py')
    return {package for path in python_files for package in find_imports(path)}

def find_imports(file_path):
    with file_path.open() as f:
        for line in f:
            match = re.search(r'\bimport (\w+)')
            if match:
                yield match.group(1)

def pip_install(package):
    subprocess.call(['pip', 'install', package])

if __name__ == '__main__':
    for package in naive_build_package_list(sys.argv[1]):
        pip_install(package)

For a simple script, this might work. In the real world, though, things quickly become considerably more complicated.

Problem 1: Local imports might conflict with PyPI packages

Suppose we have a directory structure like this:

my_package
    │
    ├ legacy_utils.py
    └ main.py

If main.py contains the line import legacy_utils, our naive script will try and fail to pip-install legacy_utils. That's perhaps inefficient but harmless. However, if we instead have a structure like:

my_package
    │
    ├ pendulum.py # perhaps this is a "swinging pendulum" simulation
    └ main.py

...then the line import pendulum in main.py will result in the PyPI package pendulum being installed by the script, completely unnecessarily, since the local import takes precedence.

This is not a major problem as we would still be able to run main.py assuming the rest of its dependencies are correctly resolved, even if we install unnecessary packages. It could lead to a lot of garbage being installed, since many common script names (utils, config, helpers, etc.) have corresponding PyPI entries.

We could fix this problem by checking the local directory before trying to pip install anything.

Problem 2: The PyPI name and the import name can be different

For some packages, e.g. wheel-inspect (which I will use in the script below), the name the package is listed as on PyPI is not the same as the namespaces made available for import by the package. In the case of wheel-inspect, we must import wheel_inspect, with an underscore. In fact, there is no requirement for the import names and the package name to have anything to do with one another at all! Sometimes there is good reason for this pattern, e.g. an API might wish to prefix its PyPI package name with "py" in order to distinguish it on search engines from implementations in other languages, but in a python script the prefix would only add noise.

The upshot is that our naive script will not work for e.g. import wheel_inspect, as it will try pip install wheel_inspect rather than pip install wheel-inspect.

The simplest way of solving this problem I can see is to maintain a mapping from import names to possible PyPI package names. We could obtain this list by downloading the wheel for each of the top 1000 or so PyPI packages and looking at what the metadata tells us its import names ought to be. Here's a simple implementation of that idea. (This script only works with packages distributed as wheels, but it would be easy to extend it to other distribution formats.)

from subprocess import call
from pathlib import Path
from collections import defaultdict
import os
import json

import requests
from wheel_inspect import inspect_wheel

def get_top_packages(n=5000):
    url = 'https://hugovk.github.io/top-pypi-packages/top-pypi-packages-365-days.json'
    data = requests.get(url).json()
    return [row['project'] for row in data['rows'][:n]]

def build_module_mapping(package_names):
    temp_dir = 'temp_dir'
    try:
        os.mkdir(temp_dir)
    except FileExistsError:
        pass
    os.chdir(temp_dir)
    package_to_import_name = {}
    for package in package_names:
        call(['pip', 'download', package, '--no-deps'])
        try:
            wheel_path = next(Path('.').glob('*.whl'))
        except StopIteration: # package does not have a wheel
            next(Path('.').glob('*')).unlink()
            continue
        wheel_info = inspect_wheel(wheel_path.name)
        package_to_import_name[package] = wheel_info['dist_info']['top_level']
        wheel_path.unlink()
    import_name_to_package = reverse_dict(package_to_import_name)

    return import_name_to_package

def reverse_dict(d):
    reversed_dict = defaultdict(list)
    for (key, vals) in d.items():
        for val in vals:
            reversed_dict[val].append(key)
    return reversed_dict

if __name__ == '__main__':
    packages = get_top_packages(n=50)
    module_mapping = build_module_mapping(packages)
    with open('module_mapping.json', 'w') as f:
        json.dump(module_mapping, f)

Problem 3: imports can be non-standard

Since Python is dynamic and has powerful introspection capabilities, the code to import a library might not look like a straightforward import <package> or from <package> import <object> statement. To drive the point home, the following is a silly but valid python import:

import codecs
exec(codecs.encode('vzbeg ahzcl', 'rot13'))

Of course no reasonable script would import this way unless it was intentionally trying to obfuscate, but there are certainly other nonstandard import mechanisms used, such as importlib.import_module or __import__, in order to programmatically or conveniently load modules.

As an example, a codebase I once worked on had a "magic data loader" which augmented the python import system such that, when running from loader import data, it would search in a particular directory for a file matching the name data, deserialize it based on its file extension, then return that as the value of data. Certainly clever, but not particularly easy to follow!

Some non-standard imports can be found by parsing the AST of a script, rather than crudely grepping through the source code as I have done here, but for pathological imports (such as if the loader example above had been implemented as import data rather than from loader import data) there's no getting around reading and/or running the code.

Conclusion - and recommendation

There is no silver bullet for resolving all these issues perfectly. Having said that, if you're actually looking to solve this problem, use pigar. It's a well-designed tool which will work for 99% of use cases. I haven't battle-tested it extensively, but until further notice it will be my official recommendation. Here's hoping we don't find any more un-runnable scripts lying around!