Python developers trust their applications to have a solid security state due to the use of standard libraries and common frameworks. However, within Python, just like in any other programming language, there are certain features that can be misleading or misused by developers. Often it is only a very minor subtlety or detail that can make developers slip and add a severe security vulnerability to the code base.
In this blog post, we share 10 security pitfalls we encountered in real-world Python projects. We chose pitfalls that we believe are less known in the developer community. By explaining each issue and its impact we hope to raise awareness and sharpen your security mindset. If you are using any of these features, make sure to check your Python code!
1. Optimized Asserts
Python offers the ability to execute code in an optimized way. This allows the code to run faster and with less memory. It is especially effective when the application is used on a large scale or when there are few resources available. Some pre-packaged Python applications are provided with optimized bytecode. However, when code is optimized, all assert
statements are ignored. These are sometimes used by developers to assess certain conditions within the code. If an assert
is used, for example, as part of an authentication check this can lead to a security bypass.
In this example, the assert statement in line 2 would be ignored and every non-super user could reach the next lines of code. It is not recommended to use assert statements for security-related checks but we do see them in real-world applications.
2. MakeDirs Permissions
The function os.makedirs
creates one or more folders in the file system. Its second parameter mode
is used to specify the default permission of the created folders. In line 2 of the following code snippet, the folders A/B/C are created with rwx------
(0o700) permission. This implies that only the current user (owner) has read, write and execute rights for these folders.
In Python < 3.6, the folders A, B and C are each created with permission 700. However, in Python > 3.6, only the last folder C has permission 700 and the other folders A and B are created with the default permission 755. So, with Python > 3.6, the function os.makedirs
has the same properties as the Linux command: mkdir -m 700 -p A/B/C
. Some developers are unaware of the difference between the versions and it has already led to a permission escalation vulnerability in Django (CVE-2020-24583) and, in a very similar way, to a hardening bypass in WordPress.
3. Absolute Path Joins
The os.path.join(path, *paths)
function is used to join multiple file path components into a combined file path. The first parameter usually contains the basepath while each further parameter is appended to the basepath as a component. However, the function has a peculiarity that some developers are not aware of. If one of the appended components starts with a /
, all previous components including the basepath are removed and this component is treated as an absolute path. The following example shows this possible pitfall for developers.
In line 3, the resulting path is constructed from the user-controlled input filename
using the os.path.join
function. In line 4, the resulting path is checked to see if it contains a .
to prevent a path traversal vulnerability. However, if the attacker passes the filename parameter /a/b/c.txt
then the resulting variable file_path
in line 3 is an absolute file path. The var/lib
components including the basepath are now ignored by os.path.join
and an attacker can read any file without using a single .
character. Although this behavior is described in the os.path.join
documentation it has led to numerous vulnerabilities in the past (Cuckoo Sandbox Evasion, CVE-2020-35736).
4. Arbitrary Temp Files
The tempfile.NamedTemporaryFile
function is used to create temporary files with a specific name. However, the prefix
and suffix
parameters are vulnerable to a path traversal attack (Issue 35278). If an attacker controls one of these parameters, he can create a temporary file at an arbitrary location in the file system. The following example shows a possible pitfall for developers.
In line 3, the user input id
is used as a prefix for the temporary file. If an attacker passes the payload /../var/www/test
as the id
parameter, the following tmp file is created: /var/www/test_zdllj17
. This may sound harmless at first glance, but it provides an attacker a basis for exploiting more complex vulnerabilities.
5. Extended Zip Slip
Extracting uploaded file archives is a common feature in web applications. In Python, the functions TarFile.extractall
and TarFile.extract
are known to be vulnerable to a Zip Slip attack. That's when an attacker tampers with the file names inside an archive so that they contain path traversal (../
) characters. That's why archive entries should always be considered as untrusted sources. The zipfile.extractall
and zipfile.extract
functions sanitize zip entries and thus prevent such path traversal vulnerabilities. But, this does not mean that a path traversal vulnerability can’t occur within the ZipFile library. The following example shows a code for extracting zip files.
In line 3, a ZipFile
handler is created from the temporary path of the uploaded user file. In lines 4 - 8, all zip entries ending with .html
are extracted. The function zf.namelist
in line 7 contains the name of an entry within the zip file. Note that only the zipfile.extract
and zipfile.extractall
functions sanitize the entries, not any of the other functions. In this case an attacker can create a filename, e.g. ../../../var/www/html
, with arbitrary content. The contents of the malicious file are read in line 6 and written to the attacker's controlled path in lines 7-8. As a result, an attacker is allowed to create arbitrary HTML files on the entire server.
As mentioned above, entries inside an archive should be considered untrusted. If you don’t use zipfile.extractall
or zipfile.extract
you should always sanitize the names of the zip entries e.g. by using os.path.basename
. Otherwise it could lead to a critical security vulnerability like the one found in NLTK Downloader (CVE-2019-14751).
6. Incomplete Regex Match
Regular expressions (regex) are an integral part of most web applications. We commonly see them used by custom Web Application Firewalls (WAF) for input validation, e.g. to detect malicious strings. In Python, there is a subtle difference between re.match
and re.search
that we would like to demonstrate in the following code snippet.
In line 2, a pattern is defined that matches a union
or select
to detect a possible SQL Injection. This is a terrible idea, as you can often bypass these blacklists, but we’ve seen it in real-world applications. In line 4 the function re.match
is used with the previously defined pattern to check if the user input name
in line 3 contains any of these malicious values. However, unlike the re.search
function, the re.match
function does not match on new lines. For example, if an attacker submitted the value aaaaaa \n union select
, the user input would not match the regex. As a result, the check can be bypassed and does not provide any protection. Overall, we do not recommend using a regex deny list for any security checks.
7. Unicode Sanitizer Bypass
Unicode allows characters to be used in multiple representations and maps these characters to codepoints. In the Unicode standard, four normalizations are defined for different Unicode characters. An application can use these normalizations to store data, such as a user name, in a uniform way independent of the human language. However, an attacker can exploit these normalizations, and that has already led to a vulnerability in Python's urllib
(CVE-2019-9636). The following code snippet demonstrates a Cross-Site Scripting (XSS) vulnerability based on the NFKC normalization.
In line 6, the user input is sanitized by Django's escape
function to prevent an XSS vulnerability. In line 7, the sanitized input is normalized via the NFKC algorithm so that it is correctly rendered in lines 8-9 through the test.html
template.
Within the template test.html
, the variable my_input
in line 4 is marked as safe
because the developer expects special characters and assumes that the variable has already been sanitized by the escape
function. By using the keyword safe
the variable is not sanitized additionally by Django. However, due to normalization in line 7 (view.py
), the character %EF%B9%A4
is transformed to <
and %EF%B9%A5
is transformed to >
. This allows an attacker to inject arbitrary HTML tags and to trigger an XSS vulnerability. To prevent this vulnerability, user input should always be sanitized at the very last step, after it has been normalized.
8. Unicode Case Collision
As mentioned above, Unicode characters are mapped to codepoints. However, there are many different human languages and Unicode tries to unify them. This also means that there is a high probability that different characters have the same "layout". For example, the lowercase Turkish ı
(without a dot) character is I
in uppercase.
In Latin-based alphabets, the character i
is also I
in uppercase. In Unicode terms, the two different characters are mapped to the same codepoint in uppercase. This behavior is exploitable and has already led to a critical vulnerability in Django (CVE-2019-19844). Let’s have a look at the following code example of a password reset feature.
In line 6 the user input email
is provided and in lines 7-9 the provided email value is checked to see if a user with this given email exists. If the user exists, an email is sent to the user in line 10 by using the user-supplied email address from line 6. It is important to mention that the check of the email address in lines 7-9 is performed case-insensitively by using the upper
function first. For the attack, we assume that a user with the email foo@mix.com
exists in the database. An attacker can now simply pass foo@mıx.com
as the email in line 6 where the i
is replaced with the Turkish ı
. In line 7 the email is then transformed to uppercase which results in FOO@MIX.COM
. This means that a user has been found and a password reset email is sent. However, the email is sent to the untransformed email address from line 6 and therefore still contains the Turkish ı
. In other words, the password of another user is sent to the attacker-controlled email address. To prevent this vulnerability, line 10 can be replaced with the user's email from the database. Even if a collision occurs, an attacker has no benefit from it in this context.
9. IP Address Normalisation
In Python < 3.8, IP addresses are normalized by the ipaddress
library so that leading zeros are removed. This behavior might look harmless at first glance, but it has already led to a high-severity vulnerability in Django (CVE-2021-33571). An attacker can exploit the normalization to bypass potential validators for Server-Side Request Forgery (SSRF) attacks. The following code snippet shows how such a validator can be bypassed.
In line 5, an IP address is given by a user, and in line 7, a denylist is used to check if the IP is a local address in order to prevent a possible SSRF vulnerability. The denylist is not complete and is only used as an example. In line 9 the code checks whether the provided IP is an IPv4 address and at the same time the IP is normalized. The actual request to the provided IP is performed on line 12 after all validations. However, an attacker could pass 127.0.00.1
as the IP address, which is not found in the denylist in line 7. Afterward, in line 9, the IP is normalized to 127.0.0.1
using ipaddress.IPv4Address
. As a consequence, the attacker is able to bypass the SSRF validator and send requests to the local network addresses.
10. URL Query Parsing
In Python < 3.7 the function urllib.parse.parse_qsl
allows the use of the ;
and &
characters as separators for URL query variables. What's interesting here is that the ;
character is not recognized as a separator by other languages. In the following example, we would like to show why this behavior could lead to a vulnerability. Let's assume that we are running an infrastructure where the frontend is a PHP application and there is another internal Python application.
An attacker sends the following GET request to the PHP frontend:
The PHP frontend recognizes only one query variable: a
with the content 1;b=2
. PHP does not treat ;
characters as separators for query variables. Now the frontend forwards the attacker's request to an internal Python application with the query variable a
:
If urllib.parse.parse_qsl
is used, the Python application processes two query variables: a=1
and b=2
This difference in the parsing of query variables can lead to fatal security vulnerabilities, like the web cache poisoning vulnerability in Django (CVE-2021-23336).
Summary
In this blog post, we introduced 10 Python security pitfalls that we believe are less known among developers. Each subtle pitfall can be easily overlooked and has led to security vulnerabilities in real-world applications in the past.
We have seen that pitfalls can occur in all kinds of operations, from processing files, directories, archives, URLs, and IPs to simple strings. A common pattern is the use of library functions which can have unexpected behavior. This reminds us to always upgrade to the latest version and to carefully read the documentation. At SonarSource, we are researching about these pitfalls to continuously improve our code analyzers.