Managing Risks of AI-generated Code
Modern software is created by combining pre-existing software packages into a software product. This approach is enabled by the growing popularity of the Open-Source paradigm, where the source code of software packages is made available under licenses that allow reuse. This approach speeds up software development with significant economic benefits, but also creates the risk of inadvertently importing vulnerable code into critical software tools. The risk is further compounded by the increasing use of AI tools for code generation in Open-Source development. These tools must be trained on enormous amounts of data, which is not always rigorously reviewed, and thus they may learn to generate vulnerable code. To make matters worse, malicious parties may actively inject malicious code in their training set. Unfortunately, all these issues are still poorly understood. This project aims at measuring and mitigating the risks emerging from AI-generated code in the software supply chain. It investigates how prevalent the use of AI tools is, and characterize the security risks they entail. Our current work is focusing on determining whether the use of AI for coding can be reliably detected within codebases. Our SCORED 2023 paper proposes a preliminary investigation of this problem; while our DIMVA 2025 paper rigorously investigate the performance of existing AI code detectors.