
PragmaticCode Introduction This repository hosts the official data artifact for PragmaticCode dataset from the paper "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context" appearing at NeurIPS 2023 ("Guiding Language Models of Code with Global Context using Monitors" on Arxiv). The full code and data artifact along with detailed instructions in available in the official repository at https://github.com/microsoft/monitors4codegen. The work introduces Monitor-Guided Decoding (MGD) for code generation using Language Models, where a monitor uses static analysis to guide the decoding. PragmaticCode is a dataset of real-world open-source Java projects complete with their development environments and dependencies (through their respective build systems). The authors tried to ensure that all the repositories in PragmaticCode were released publicly only after the determined training dataset cutoff date (31 March 2022) for the CodeGen, SantaCoder and text-davinci-003 family of models, which were used to evaluate MGD. The list of repositories along with their respective licenses, and path to zipped repository content is available in PragmaticCode/repos.csv. The zipped contents of the full repositories is available under PragmaticCode/github. The contents of the files required for inference for each of the repositories is available in PragmaticCode/fileContentsByRepo.json. DotPrompts For evaluation of Language Models of Code, the authors curate a set of 10,000+ examples spanning 1400+ methods from PragmaticCode, such that each example consists of a prompt to a dereference location (a code location having the "." operator in Java). This can be used to benchmark Language Models of Code on their ability to utilize repository level context to generate code for method-level completion tasks. The task for the models is to complete a partially written Java method, utilizing the full repository available from PragmaticCode. Since all the repositories in PragmaticCode are buildable, DotPrompts supports Compilation Rate as a metric of evaluation for generated code, apart from standard metrics of ground truth match like Next-Identifier Match, Identifier Sequence Match and Prefix Match. Further details on DotPrompts and its usage is available at https://github.com/microsoft/monitors4codegen#dotprompts. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
code repositories, llm4code, ai4code, program synthesis, software engineering
code repositories, llm4code, ai4code, program synthesis, software engineering
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
