Blog post

10 Unknown Security Pitfalls for Python

Dennis Brinkrolf

Security Researcher

November 16, 2021

Date

In this blog post, we share 10 security pitfalls for Python developers that we encountered in real-world projects.

Python developers trust their applications to have a solid security state due to the use of standard libraries and common frameworks. However, within Python, just like in any other programming language, there are certain features that can be misleading or misused by developers. Often it is only a very minor subtlety or detail that can make developers slip and add a severe security vulnerability to the code base.

In this blog post, we share 10 security pitfalls we encountered in real-world Python projects. We chose pitfalls that we believe are less known in the developer community. By explaining each issue and its impact we hope to raise awareness and sharpen your security mindset. If you are using any of these features, make sure to check your Python code!

1. Optimized Asserts

Python offers the ability to execute code in an optimized way. This allows the code to run faster and with less memory. It is especially effective when the application is used on a large scale or when there are few resources available. Some pre-packaged Python applications are provided with optimized bytecode. However, when code is optimized, all assert statements are ignored. These are sometimes used by developers to assess certain conditions within the code. If an assert is used, for example, as part of an authentication check this can lead to a security bypass.

def superuser_action(request, user):
    assert user.is_super_user
    # execute action as super user

view.py

In this example, the assert statement in line 2 would be ignored and every non-super user could reach the next lines of code. It is not recommended to use assert statements for security-related checks but we do see them in real-world applications.

2. MakeDirs Permissions

The function os.makedirs creates one or more folders in the file system. Its second parameter mode is used to specify the default permission of the created folders. In line 2 of the following code snippet, the folders A/B/C are created with rwx------ (0o700) permission. This implies that only the current user (owner) has read, write and execute rights for these folders.

def init_directories(request):
    os.makedirs("A/B/C", mode=0o700)
    return HttpResponse("Done!")

view.py

In Python < 3.6, the folders A, B and C are each created with permission 700. However, in Python > 3.6, only the last folder C has permission 700 and the other folders A and B are created with the default permission 755. So, with Python > 3.6, the function os.makedirs has the same properties as the Linux command: mkdir -m 700 -p A/B/C. Some developers are unaware of the difference between the versions and it has already led to a permission escalation vulnerability in Django (CVE-2020-24583) and, in a very similar way, to a hardening bypass in WordPress.

3. Absolute Path Joins

The os.path.join(path, *paths) function is used to join multiple file path components into a combined file path. The first parameter usually contains the basepath while each further parameter is appended to the basepath as a component. However, the function has a peculiarity that some developers are not aware of. If one of the appended components starts with a /, all previous components including the basepath are removed and this component is treated as an absolute path. The following example shows this possible pitfall for developers.

def read_file(request): filename = request.POST['filename']
    file_path = os.path.join("var", "lib", filename)
    if file_path.find(".") != -1:
        return HttpResponse("Failed!")
    with open(file_path) as f:
        return HttpResponse(f.read(), content_type='text/plain')

view.py

In line 3, the resulting path is constructed from the user-controlled input filename using the os.path.join function. In line 4, the resulting path is checked to see if it contains a . to prevent a path traversal vulnerability. However, if the attacker passes the filename parameter /a/b/c.txt then the resulting variable file_path in line 3 is an absolute file path. The var/lib components including the basepath are now ignored by os.path.join and an attacker can read any file without using a single . character. Although this behavior is described in the os.path.join documentation it has led to numerous vulnerabilities in the past (Cuckoo Sandbox Evasion, CVE-2020-35736).

4. Arbitrary Temp Files

The tempfile.NamedTemporaryFile function is used to create temporary files with a specific name. However, the prefix and suffix parameters are vulnerable to a path traversal attack (Issue 35278). If an attacker controls one of these parameters, he can create a temporary file at an arbitrary location in the file system. The following example shows a possible pitfall for developers.

def touch_tmp_file(request):
    id = request.GET["id"]
    tmp_file = tempfile.NamedTemporaryFile(prefix=id)
    return HttpResponse(f"tmp file: {tmp_file} created!", content_type="text/plain")

view.py

In line 3, the user input id is used as a prefix for the temporary file. If an attacker passes the payload /../var/www/test as the id parameter, the following tmp file is created: /var/www/test_zdllj17. This may sound harmless at first glance, but it provides an attacker a basis for exploiting more complex vulnerabilities.

5. Extended Zip Slip

Extracting uploaded file archives is a common feature in web applications. In Python, the functions TarFile.extractall and TarFile.extract are known to be vulnerable to a Zip Slip attack. That's when an attacker tampers with the file names inside an archive so that they contain path traversal (../) characters. That's why archive entries should always be considered as untrusted sources. The zipfile.extractall and zipfile.extract functions sanitize zip entries and thus prevent such path traversal vulnerabilities. But, this does not mean that a path traversal vulnerability can’t occur within the ZipFile library. The following example shows a code for extracting zip files.

def extract_html(request):
    filename = request.FILES["filename"]
    zf = zipfile.ZipFile(filename.temporary_file_path(), "r")
    for entry in zf.namelist():
        if entry.endswith(".html"):
            file_content = zf.read(entry)
            with open(entry, "wb") as fp:
                fp.write(file_content)
    zf.close()
    return HttpResponse("HTML files extracted!")

view.py

In line 3, a ZipFile handler is created from the temporary path of the uploaded user file. In lines 4 - 8, all zip entries ending with .html are extracted. The function zf.namelist in line 7 contains the name of an entry within the zip file. Note that only the zipfile.extract and zipfile.extractall functions sanitize the entries, not any of the other functions. In this case an attacker can create a filename, e.g. ../../../var/www/html, with arbitrary content. The contents of the malicious file are read in line 6 and written to the attacker's controlled path in lines 7-8. As a result, an attacker is allowed to create arbitrary HTML files on the entire server.

As mentioned above, entries inside an archive should be considered untrusted. If you don’t use zipfile.extractall or zipfile.extract you should always sanitize the names of the zip entries e.g. by using os.path.basename. Otherwise it could lead to a critical security vulnerability like the one found in NLTK Downloader (CVE-2019-14751).

6. Incomplete Regex Match

Regular expressions (regex) are an integral part of most web applications. We commonly see them used by custom Web Application Firewalls (WAF) for input validation, e.g. to detect malicious strings. In Python, there is a subtle difference between re.match and re.search that we would like to demonstrate in the following code snippet.

def is_sql_injection(request):
    pattern = re.compile(r".*(union)|(select).*")
    name_to_test = request.GET["name"]
    if re.search(pattern, name_to_test):
        return True
    return False

view.py

In line 2, a pattern is defined that matches a union or select to detect a possible SQL Injection. This is a terrible idea, as you can often bypass these blacklists, but we’ve seen it in real-world applications. In line 4 the function re.match is used with the previously defined pattern to check if the user input name in line 3 contains any of these malicious values. However, unlike the re.search function, the re.match function does not match on new lines. For example, if an attacker submitted the value aaaaaa \n union select, the user input would not match the regex. As a result, the check can be bypassed and does not provide any protection. Overall, we do not recommend using a regex deny list for any security checks.

7. Unicode Sanitizer Bypass

Unicode allows characters to be used in multiple representations and maps these characters to codepoints. In the Unicode standard, four normalizations are defined for different Unicode characters. An application can use these normalizations to store data, such as a user name, in a uniform way independent of the human language. However, an attacker can exploit these normalizations, and that has already led to a vulnerability in Python's urllib (CVE-2019-9636). The following code snippet demonstrates a Cross-Site Scripting (XSS) vulnerability based on the NFKC normalization.

import unicodedata
from django.shortcuts import render
from django.utils.html import escape

def render_input(request):
    user_input = escape(request.GET["p"])
    normalized_user_input = unicodedata.normalize("NFKC", user_input)
    context = {"my_input": normalized_user_input}
    return render(request, "test.html", context)

view.py

In line 6, the user input is sanitized by Django's escape function to prevent an XSS vulnerability. In line 7, the sanitized input is normalized via the NFKC algorithm so that it is correctly rendered in lines 8-9 through the test.html template.

<!DOCTYPE html>
<html lang="en">
    <body>
        {{ my_input | safe }}
    </body>
</html>

templates/test.html

Within the template test.html, the variable my_input in line 4 is marked as safe because the developer expects special characters and assumes that the variable has already been sanitized by the escape function. By using the keyword safe the variable is not sanitized additionally by Django. However, due to normalization in line 7 (view.py), the character %EF%B9%A4 is transformed to < and %EF%B9%A5 is transformed to >. This allows an attacker to inject arbitrary HTML tags and to trigger an XSS vulnerability. To prevent this vulnerability, user input should always be sanitized at the very last step, after it has been normalized.

8. Unicode Case Collision

As mentioned above, Unicode characters are mapped to codepoints. However, there are many different human languages and Unicode tries to unify them. This also means that there is a high probability that different characters have the same "layout". For example, the lowercase Turkish ı (without a dot) character is I in uppercase.

In Latin-based alphabets, the character i is also I in uppercase. In Unicode terms, the two different characters are mapped to the same codepoint in uppercase. This behavior is exploitable and has already led to a critical vulnerability in Django (CVE-2019-19844). Let’s have a look at the following code example of a password reset feature.

from django.core.mail import send_mail
from django.http import HttpResponse
from vuln.models import User

def reset_pw(request):
    email = request.GET["email"]
    result = User.objects.filter(email__exact=email.upper()).first()
    if not result:
        return HttpResponse("User not found!")
    send_mail(
        "Reset Password",
        "Your new pw: 123456.",
        "from@example.com",
        [email],
        fail_silently=False,
    )
    return HttpResponse("Password reset email sent!")

view.py

In line 6 the user input email is provided and in lines 7-9 the provided email value is checked to see if a user with this given email exists. If the user exists, an email is sent to the user in line 10 by using the user-supplied email address from line 6. It is important to mention that the check of the email address in lines 7-9 is performed case-insensitively by using the upper function first. For the attack, we assume that a user with the email foo@mix.com exists in the database. An attacker can now simply pass foo@mıx.com as the email in line 6 where the i is replaced with the Turkish ı. In line 7 the email is then transformed to uppercase which results in FOO@MIX.COM. This means that a user has been found and a password reset email is sent. However, the email is sent to the untransformed email address from line 6 and therefore still contains the Turkish ı. In other words, the password of another user is sent to the attacker-controlled email address. To prevent this vulnerability, line 10 can be replaced with the user's email from the database. Even if a collision occurs, an attacker has no benefit from it in this context.

9. IP Address Normalisation

In Python < 3.8, IP addresses are normalized by the ipaddress library so that leading zeros are removed. This behavior might look harmless at first glance, but it has already led to a high-severity vulnerability in Django (CVE-2021-33571). An attacker can exploit the normalization to bypass potential validators for Server-Side Request Forgery (SSRF) attacks. The following code snippet shows how such a validator can be bypassed.

import requests
import ipaddress

def send_request(request):
    ip = request.GET["ip"]
    try:
        if ip in ["127.0.0.1", "0.0.0.0"]:
            return HttpResponse("Not allowed!")
        ip = str(ipaddress.IPv4Address(ip))
    except ipaddress.AddressValueError:
        return HttpResponse("Error at validation!")
    requests.get("https://" + ip)
    return HttpResponse("Request send!")

view.py

In line 5, an IP address is given by a user, and in line 7, a denylist is used to check if the IP is a local address in order to prevent a possible SSRF vulnerability. The denylist is not complete and is only used as an example. In line 9 the code checks whether the provided IP is an IPv4 address and at the same time the IP is normalized. The actual request to the provided IP is performed on line 12 after all validations. However, an attacker could pass 127.0.00.1 as the IP address, which is not found in the denylist in line 7. Afterward, in line 9, the IP is normalized to 127.0.0.1 using ipaddress.IPv4Address. As a consequence, the attacker is able to bypass the SSRF validator and send requests to the local network addresses.

10. URL Query Parsing

In Python < 3.7 the function urllib.parse.parse_qsl allows the use of the ; and & characters as separators for URL query variables. What's interesting here is that the ; character is not recognized as a separator by other languages. In the following example, we would like to show why this behavior could lead to a vulnerability. Let's assume that we are running an infrastructure where the frontend is a PHP application and there is another internal Python application.

An attacker sends the following GET request to the PHP frontend:

GET https://victim.com/?a=1;b=2

The PHP frontend recognizes only one query variable: a with the content 1;b=2. PHP does not treat ; characters as separators for query variables. Now the frontend forwards the attacker's request to an internal Python application with the query variable a:

GET https://internal.backend/?a=1;b=2

If urllib.parse.parse_qsl is used, the Python application processes two query variables: a=1 and b=2 This difference in the parsing of query variables can lead to fatal security vulnerabilities, like the web cache poisoning vulnerability in Django (CVE-2021-23336).

Summary

In this blog post, we introduced 10 Python security pitfalls that we believe are less known among developers. Each subtle pitfall can be easily overlooked and has led to security vulnerabilities in real-world applications in the past.

We have seen that pitfalls can occur in all kinds of operations, from processing files, directories, archives, URLs, and IPs to simple strings. A common pattern is the use of library functions which can have unexpected behavior. This reminds us to always upgrade to the latest version and to carefully read the documentation. At SonarSource, we are researching about these pitfalls to continuously improve our code analyzers.

10 Unknown Security Pitfalls for Python

1. Optimized Asserts

2. MakeDirs Permissions

3. Absolute Path Joins

4. Arbitrary Temp Files

5. Extended Zip Slip

6. Incomplete Regex Match

7. Unicode Sanitizer Bypass

8. Unicode Case Collision

9. IP Address Normalisation

10. URL Query Parsing

Summary

Related Blog Posts

SHARE

10 Unknown Security Pitfalls for Python

.css-1s68n4h{position:absolute;top:-150px;}1. Optimized Asserts.css-5cm1aq{color:#000000;}.css-1jw8ybl{margin-left:10px;margin-top:-1px;display:inline-block;fill:#5F656D;margin-left:14px;}.css-1jw8ybl:hover{fill:#290042;}

2. MakeDirs Permissions

3. Absolute Path Joins

4. Arbitrary Temp Files

5. Extended Zip Slip

6. Incomplete Regex Match

7. Unicode Sanitizer Bypass

8. Unicode Case Collision

9. IP Address Normalisation

10. URL Query Parsing

Summary

Related Blog Posts

SHARE

1. Optimized Asserts