Software Heritage Statement on Large Language Models for Code – Software Heritage
Discovered: Oct 19, 2023 08:59 Software Heritage Statement on Large Language Models for Code – Software Heritage <– QUOTE: Principles: 1. Knowledge derived from the Software Heritage archive must be given back to humanity, rather than monopolized for private gain. The resulting machine learning models must be made available under a suitable open license, together with the documentation and toolings needed to use them. 2. The initial training data extracted from the Software Heritage archive must be fully and precisely identified by, for example, publishing the corresponding SWHID identifiers (note that, in the context of Software Heritage, public availability of the initial training data is a given: anyone can obtain it from the archive). This will enable use cases such as: studying biases (fairness), verifying if a code of interest was present in the training data (transparency), and providing appropriate attribution when generated code bears resemblance to training data (credit), among others. 3. Mechanisms should be established, where possible, for authors to exclude their archived code from the training inputs before model training begins