In the past years, we have been continuously improving our SAST (Static Application Security Testing) capabilities by taking different approaches. This year, after considering various options, we decided to leverage benchmarks for tracking our progress. Before delving into a blog series discussing our chosen benchmarks and score, let me provide some background information here.
When considering a SAST solution, vendors often claim high accuracy, detection, and low or no false positives without concrete data to support those claims, At Sonar, we're confident in the quality of our security analyzers, but we've always been cautious about bragging when engaging with potential customers.
In the past, we've stated that we have a detection rate of 80% and a false-positive rate of no more than 20%. However, for specific cases such as the OWASP Benchmark project, we can provide more detailed information because when we entered the security market in 2019, we took the time to evaluate our coverage of this Java benchmark.
At that time, with SonarQube 7.9 Developer Edition, our True-Positive Rate (TPR) was at 85%, and our False-Detection Rate (FDR) at 1%.
At Sonar, our focus has always been on delivering value to developers through our products, and that remains unchanged. For us, getting good results on a given benchmark was not a goal in itself but more a positive side effect of all the work we were doing to raise accurate and actionable issues, easy to understand with a good level of documentation to help the developers to fix the vulnerabilities.
At the same time, we started to receive in 2022 more and more feedback from prospects such as this one:
Our first reaction has been always the same and our reply was: “This is normal, the issues we don’t raise are the false-positive ones raised by the others”.
However, on a more serious note, these demands made us realize one thing. Not all companies in the world have the time or resources to run a thorough assessment of the quality of SAST solutions. As a result, they resort to using randomly selected projects on GitHub to evaluate and assess the maturity level of SAST solutions. In January, we decided that we should help companies to do the right choice by providing three things:
- A list of the top 3 SAST benchmarks by language
- The list of the issues that should be detected in these projects. We call that the Ground Truth,
- The Sonar’s results
In order to establish the list of candidates' benchmarks, we looked at dozens of projects and applied the following criteria to our selection:
- we can select on-purpose vulnerable applications even if they are not originally designed as SAST benchmarks because it’s usually what users select
- the list of potential benchmarks should be ordered by popularity (downloads, GitHub stars, activity, requests received by our sales engineering team)
- we want projects that are selected by users to assess the maturity of SAST engines
- we want benchmarks that are not linked to a specific vendor to avoid bias
- the benchmarks should illustrate test cases corresponding to real problems that are in the code and can be detected by a SAST engine.
- we want benchmarks for the main languages used on the market to build Web/API applications (Java, C#, Python, PHP, JavaScript/TypeScript)
We considered a total of 109 projects and selected the top 3 for each language. Then we started the not-so-easy work of carefully reviewing them to build the Ground Truth for each project. Throughout this process, we ensured that every true vulnerability was accurately identified. In case of disagreement with the statement of the benchmark, the test case was considered as “not a problem to find” and added to the list of unexpected issues.
This Ground Truth for each project is made of:
- The list of all the locations (file, line, type of problem) where an issue should be detected and considered as a True-Positive (TP).
- The list of all the locations where no issue is expected (True-Negative / TN). This includes locations where the benchmark itself was saying the SAST products should detect and where we disagree with the benchmark’s statement.
One fun fact related to the activity of building the Ground Truth is that most projects that are used as benchmarks don’t publish the list of expected/not-expected issues. The OWASP Benchmark stands out as an exception, as it provides this information effectively even if some test cases are challengeable.
Now that you have the full context, you’ll have a better understanding of the upcoming blog posts. We will share the list of benchmarks, the corresponding Ground Truth, and Sonar’s results for these benchmarks.
Sign up using the simple form below and be notified about the next in our series, which will be about Java benchmarks.
Alex