Task

Secure the Pandas library and its dependencies in a Python environment, ensuring safe operation particularly in restricted or "jailed" setups. Identify potential security risks involving unsafe system operations and mitigate them.

Pandas 2.2.2 dependencies

pandas==2.2.2 ├── numpy [1.22.4] ├── python-dateutil [2.9.0.post0] │ └── six [1.16.0] ├── pytz [2024.1] └── tzdata [2024.1]

In Pandas, operations that could potentially lead to unauthorized system access are typically related to:

  1. File I/O Operations: These involve reading from or writing to files on the filesystem. If these functions are used improperly, they can expose or alter sensitive files.
  2. External Command Execution: While Pandas itself does not generally execute system commands, it might call libraries or extensions that do.
  3. Network Operations: Pandas does not directly support many network operations, but it can interact with URLs or network data sources through its data I/O functions.

Steps to install the modified library

  1. Create a new python environment with version 3.10.14

  2. Run pip install -r requirements.txt

  3. cd into the numpy directory and run pip install .

  4. cd into the pandas directory and run pip install .

Code Analysis

Analyzing pandas source code for unsafe operations

  1. Mitigating XSS vulnerabilities

    Issue: By default, jinja2 sets autoescape to False.

    Solution : By setting autoescape=True, you mitigate XSS vulnerabilities by automatically escaping variables in your templates, which helps prevent malicious injection of code into your HTML output.

The snippets below are the modified snippets

Location: ./pandas/io/formats/style.py:3615:18

3614	        class MyStyler(cls):  # type: ignore[valid-type,misc]
3615	            env = jinja2.Environment(loader=loader, autoescape=True)
3616	            if html_table:

Location: ./pandas/io/formats/style_render.py:73:10

72	    loader = jinja2.PackageLoader("pandas", "io/formats/templates")
73	    env = jinja2.Environment(loader=loader, trim_blocks=True, autoescape=True)
74	    template_html = env.get_template("html.tpl")

Location: ./pandas/tests/io/formats/style/test_html.py:22:10

21	    loader = jinja2.PackageLoader("pandas", "io/formats/templates")
22	    env = jinja2.Environment(loader=loader, trim_blocks=True, autoescape=True)
23	    return env

Location: ./web/pandas_web.py:453:16

452	    templates_path = os.path.join(source_path, context["main"]["templates_path"])
453	    jinja_env = jinja2.Environment(loader=jinja2.FileSystemLoader(templates_path), autoescape=True)
454
  1. Using sanitized URLs

Issue: Usage of urllib.request.urlopen, which can potentially allow opening URLs with file scheme (file://) or custom schemes. Allowing these schemes can lead to unexpected behavior and security vulnerabilities.

Solution : To mitigate this issue, you should review the usage of urllib.request.urlopen and ensure that it only opens URLs with permitted schemes. In our case, I restricted it to http:// and https:// schemes for accessing remote resources.

Location: ./pandas/io/common.py:288:11

(Old Snippet)

def urlopen(*args: Any, **kwargs: Any) -> Any:
    """
    Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    the stdlib.
    """
    import urllib.request
    return urllib.request.urlopen(*args, **kwargs)  # noqa: TID251

(New Snippet)

def urlopen(*args: Any, **kwargs: Any) -> Any:
    """
    Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    the stdlib.
    """
    import urllib.request

    """
    Safe wrapper for urllib.request.urlopen that restricts permitted schemes.
    """
    permitted_schemes = ('http', 'https')
    url = args[0] if args else kwargs.get('url', None)
    if not url:
        raise ValueError("URL not provided.")

    parsed_url = urllib.parse.urlparse(url)
    if parsed_url.scheme not in permitted_schemes:
        raise ValueError(f"Unsupported URL scheme: {parsed_url.scheme}")

    try:
        return urllib.request.urlopen(*args, **kwargs)
    except urllib.request.URLError as e:
        print(f"Failed to open URL: {url}. Error: {e}")
        return None

There is no need to make any modifications to IO utilities such as pandas.read_csv(), DataFrame.to_csv() and other related functions as when you study the call graph any access to a file is eventually handled by builtin open(), which appropriately handles any kind of unauthorized access to a file gracefully.

Analyzing numpy source code for unsafe operations

  1. Mitigating shell injection attacks

Issue : getstatusoutput from the subprocess module, starts a process with a shell. This can lead to security vulnerabilities, such as shell injection attacks if the command includes unsanitized input.

Solution: To mitigate this risk, we need to avoid getstatusoutput and instead use the subprocess.run function.

Location: ./numpy/distutils/cpuinfo.py:29:25

(old snippet)

28	    try:
29	        status, output = getstatusoutput(cmd)
30	    except OSError as e:

(new snippet)

28          try:
29              args = cmd.split()
30              result = run(args, capture_output=True, text=True)
31              status = result.returncode
32              output = result.stdout
33          except OSError as e:

Issue: Using shell=True in subprocess.Popen is that it can introduce security vulnerabilities, particularly command injection attacks, if the input is not properly sanitized

Solution : To mitigate this risk, you should avoid using shell=True and instead provide the command as a list of arguments.

Location: ./numpy/distutils/exec_command.py:283:15

(old snippets)

282	        # it encounters an invalid character; rather, we simply replace it
283	        proc = subprocess.Popen(command, shell=use_shell, env=env, text=False,
284	                                stdout=subprocess.PIPE,
285	                                stderr=subprocess.STDOUT)
286	    except OSError:

(new snippets)

284        if isinstance(command, str):
285                    command = command.split()
286                proc = subprocess.Popen(command, env=env, text=False,
287                                        stdout=subprocess.PIPE,
288                                        stderr=subprocess.STDOUT)
  1. Unsanitized URLs

Issue: Using urlopen without validating the URL can allow the use of potentially dangerous schemes like file:// or custom schemes, which can introduce security vulnerabilities.

Solution : To mitigate this risk, you should restrict the URL schemes to only those that are necessary and safe (e.g., http and https).

Location: ./numpy/lib/_datasource.py:336:17

(old snippet)

def _isurl(self, path):
        """Test if path is a net location.  Tests the scheme and netloc."""

        # We do this here to reduce the 'import numpy' initial import time.
        from urllib.parse import urlparse

        # BUG : URLs require a scheme string ('http://') to be used.
        #       www.google.com will fail.
        #       Should we prepend the scheme for those that don't have it and
        #       test that also?  Similar to the way we append .gz and test for
        #       for compressed versions of files.

        scheme, netloc, upath, uparams, uquery, ufrag = urlparse(path)
        return bool(scheme and netloc)

(new snippet)

def _isurl(self, path):
        """Test if path is a net location.  Tests the scheme and netloc."""

        # We do this here to reduce the 'import numpy' initial import time.
        from urllib.parse import urlparse

        parsed_url = urlparse(path)

        # Check if both scheme and netloc are present
        if parsed_url.scheme and parsed_url.netloc:
            return True

        # If scheme or netloc is missing, prepend 'http://' to the path and re-parse
        if not parsed_url.scheme:
            parsed_url = urlparse('http://' + path)
            return bool(parsed_url.scheme and parsed_url.netloc)

        return False

Next Steps

  1. The first next natural step is to the extend on our work, find more of such security flaws in pandas/numpy source code, or may be identify some new ones.

  2. Although I scouted through some other dependencies to look for security flaws, I might have just missed those. So we also need to closely examine the security flaws in python-dateutil, pytz, tzdata and fix them.

  3. On any further development of pandas or any of its dependencies, practice secure coding, don't expose any vulnerabilities.