Why Code Generating Models Got Good

1 Introduction

Late in 2022, OpenAI¹ publicly released ChatGPT², a language model that generates text from natural language prompts, and it became a recurring feature of the news cycle. ChatGPT impressed many readers with how naturally it produced text from simple prompts. It can also generate code from those prompts, demonstrating the practical utility of modern language models. Systems of this kind have the potential to reduce development time through code completion and to support a range of other tasks that involve generating code.

The focus of this review is the link between a language model’s ability to generate code and its ability to capture context within code languages. We first consider the theoretical justification for applying natural language processing techniques to code, and then track how model architectures have evolved to exploit more and more context. The aim is to see how far the increase in context-capturing capacity accounts for the gains in performance. The literature surveyed is drawn mostly from peer-reviewed journals and conferences. Where preprints are cited, it is because they have accumulated many citations and have been influential within the field of code-generating language models, so their results have been widely adopted and validated by empirical success. The literature also points to areas for further work, both in immediate extensions of current models and in the limitations that future research will need to address.

The review is broadly chronological, since the amount of context captured by language models has grown over time. Section 2.1 examines the foundation of the field, and how the idea of naturalness was justified for artificial, mathematically well-formed code languages, licensing the use of natural language processing techniques on them. Section 2.2 outlines the metrics used to evaluate language models. Section 2.3 discusses early successes in capturing short-range context in code. Section 2.4 reviews the gains deep learning has brought in capturing more context. Section 3 summarises the findings, draws conclusions, and notes possible directions for future research.

2 Literature Review

2.1 The naturalness of code

The idea that lets statistical techniques from natural language processing be applied to code is the recognition that code itself is natural. The case is made compellingly by Hindle et al. in an IEEE conference paper on software engineering. Their central claim is that real code, i.e. code written by people, carries the hallmarks of natural language, including the fact that everyday utterances are usually simple and repetitive (Hindle et al. 2012). Recognising that natural language has these properties was a sharp break from the dictionary and grammar-based approaches associated with researchers such as Noam Chomsky (Chomsky 1956). Statistical methods applied to natural language from the 1980s onwards have produced empirically successful technologies, including machine translation and speech recognition. Hindle et al. argue that if code languages can be shown to share the statistical character of natural languages, then the same NLP techniques can legitimately be applied in the artificial domain of code.

Hindle et al. argue that everyday human communication evolved to be quick and to function in noisy environments. Simplicity, and the recognition of patterns through repetition, is therefore central to how real natural language is understood. They claim the same holds for code languages, which are often as much about communication between humans as between human and machine (Hindle et al. 2012). Most programs repeat snippets and stick to preferred forms; other forms are possible, but the conventional ones are easy for humans to read and so become dominant. This is backed by earlier work from Gabel and Su, presented at an ACM symposium on the foundations of source code, in which they analysed roughly 430 million lines of source code for uniqueness and found a distinct lack of it. Their corpus showed a high level of repetition at a granularity of 1 to 7 lines of code (Gabel and Su 2010).

Hindle et al. tested the claim by building an n-gram language model and checking whether it captured regularities in software. Section 2.3 covers n-gram language models and their successes in detail. The relevant question is whether the regularities the model captures come from code syntax or from genuine naturalness. They found that the model captured a high level of local regularity, more than in English, and that this regularity was not an artefact of syntax, supporting the existence of a naturalness of code (Hindle et al. 2012). Having seen that their language model captures regularities in code, section 2.2 turns to how code language models are assessed, and what more is needed to make comparisons between them valid.

This paper by Hindle et al. has been foundational for probabilistic language modelling of code; the citation count (in the thousands) and the dominance of language models in the decade since publication are evidence enough. No papers were found that explicitly argue against the use of statistical NLP methods for code languages, likely because the empirical success of these methods in the domain is hard to dispute. The rest of this review takes for granted that researchers are licensed to use NLP techniques to build language models for code generation, in forms ranging from rudimentary n-grams to increasingly elaborate deep learning architectures.

2.2 Metrics

Researchers use a range of metrics to gauge their models on code generation. Intrinsic evaluation metrics may not capture every facet of a model’s ability, but they are reasonably standardised and used widely. Extrinsic evaluation metrics are also used, to give a fuller picture of how models perform on the tasks they are likely to encounter in practice. Extrinsic metrics, however, vary more, and there is still no standard across the literature reviewed here. This section covers the most common intrinsic metrics and one recent extrinsic metric, developed by commercial researchers and since adopted in a peer-reviewed paper.

2.2.1 Intrinsic metrics

The most common intrinsic metric in the papers reviewed here is perplexity. It is derived from cross-entropy, a metric that treats the language model as a compression algorithm: the model predicts the output, in effect decompressing it from its input. Cross-entropy measures the average number of additional bits per code token that the model needs to recover the complete and correct output. Perplexity is two raised to the cross-entropy and is often reported instead. Lower values of either metric mean the model needs fewer bits and is closer to an ideal model that decompresses perfectly with no added information (Allamanis et al. 2019).

2.2.2 Extrinsic metrics

A notable addition to the extrinsic toolkit is the HumanEval test developed by Chen et al.. The dataset contains 164 human-written programming tasks built specifically to test a language model’s code generation ability (Chen et al. 2021). The same paper describes the development of Codex³, the language model behind the commercial GitHub Copilot⁴; the exact details of the production version of Codex have not been published, though it is presumably similar to the version reported by Chen et al.. Xu et al. use this in their 2022 paper to justify using HumanEval for evaluating several different code-generating language models. With HumanEval as their extrinsic metric, they can be reasonably confident that the comparison tests are not contaminated with data the models were trained on (Xu et al. 2022). The focus of Xu et al.’s paper is a systematic evaluation of existing large language models on code generation. They note that training state-of-the-art models has become so expensive that it is predominantly done by large companies, which rarely release the resulting models as open source, and that this restricts the research possible at less well-resourced organisations (Xu et al. 2022). As a consequence, the extrinsic metrics adopted by the research community may increasingly come from commercial actors, as in the case of HumanEval.

2.3 Short Context Success

This section discusses a short-context statistical technique that has proved practical for capturing simple, local dependencies in code sequences: the n-gram language model, of the kind used to support Hindle et al.’s account of the naturalness of code (Hindle et al. 2012). The discussion below outlines how n-gram models work, their limitations, and how researchers have adapted and improved them.

N-gram models are sequence models that, for code, predict which token follows next in a sequence. They are Markov models: they assume that only the previous \(n-1\) tokens are needed to model the \(n\)th token (Allamanis et al. 2019; Hindle et al. 2012; Nguyen et al. 2013; Hellendoorn and Devanbu 2017). For a 3-gram (trigram) model, the probability of the \(n\)th token is therefore approximated as in Eq. 1..

\[\begin{equation} p(a_i|a_1 ... a_{i-1}) \approx p(a_i|a_{i-2} a_{i-1}) \end{equation}\]

The \(n\)th token is modelled by maximum-likelihood estimation from simple frequency counts in a training corpus.

It is also worth noting that n-grams are left-to-right code generators (Allamanis et al. 2019; Nguyen et al. 2013). As anyone who has written code knows, that is not really how code is produced; parts of a program frequently depend on other parts that sit some distance away. This looks like a serious conceptual limitation, but every language model discussed in this review is also left-to-right and yet still performs competently on code generation, which makes the simplest and the most complex models comparable on the same axis.

The final broad technical point about n-grams is that any corpus is bound to omit some possible n-grams. Reducing \(n\) shrinks the space of combinations, but even then some n-grams will be absent from the training data. Smoothing is used to assign non-zero probabilities to these unseen n-grams in a principled way. Hindle et al. found Kneser-Ney smoothing to be effective for their experiments (Hindle et al. 2012).

The effect of data quantity and quality on n-grams for natural language is well established; more data is a reliable way to improve performance. Allamanis and Sutton take exactly this route. Their 2013 IEEE paper assembles a source code corpus 100 times larger than the one used by Hindle et al. (Allamanis and Sutton 2013). It is the first code corpus to contain more than a billion tokens, and they describe the resulting model as a giga-token language model. Like Hindle et al., they use a trigram (3-gram) model, though the exact smoothing technique is not stated. Allamanis and Sutton report perplexity an order of magnitude lower than Hindle et al.’s earlier work (Allamanis and Sutton 2013). The exact smoothing scheme would be useful to know, but the gain can plausibly be attributed to corpus size: as more n-grams are observed, the model’s handling of unseen ones matters less. A key observation comes from their “collapsed model”, where the tokenizer maps all identifiers to a single identifier token and all literals to a single literal token, leaving a corpus of keywords and symbols that captures the structural side of the code (Allamanis and Sutton 2013). They report that the collapsed model has learnt everything it can about the corpus’s syntax after about 100,000 lines of code (Allamanis and Sutton 2013). That is a tiny fraction of the 352 million lines in the corpus, and supports the view that n-grams are highly local and unable to handle long-range dependencies.

Allamanis and Sutton’s 2013 paper has been quite influential in the n-gram analysis of code, and is cited hundreds of times. The same authors continued to work with n-grams and their large corpus, and in 2014 produced another highly cited paper applying an n-gram model to a real task. Their 2014 model suggests changes to a programmer’s code that improve stylistic consistency and adherence to a project’s coding conventions (Allamanis et al. 2014). Allamanis et al. point to an extrinsic measure of success: the system generated 18 patches for open source projects, 14 of which were accepted (Allamanis et al. 2014). The way the language model is used is novel; the n-gram acts as a scoring function that ranks the most “natural” option among alternative snippets produced by a proposer with some filtering logic. The fact that early n-gram work, extended with a better corpus, can plug directly into a real code generation application shows how useful language models can be even when their long-range context modelling is weak.

Work by Nguyen et al., published in the proceedings of an ACM conference, takes a different route to improving short-context n-gram models for code generation. The aim is to widen the context these models can use when generating code (Nguyen et al. 2013), by folding semantic information into the code tokens themselves (Nguyen et al. 2013). This goes beyond the purely lexical and highly local information used by the earlier n-gram models of Hindle et al., Allamanis and Sutton, and Allamanis et al.. Nguyen et al. argue that code has well-defined semantics that previous lexical n-gram models did not exploit (Hindle et al. 2012; Allamanis and Sutton 2013; Allamanis et al. 2014). They attach semantic annotations to each lexical code token, including its ID, role, data type, scope, structural and data dependencies, and what the authors call its sememe, a structured annotation of its semantic value (Nguyen et al. 2013). They train the semantic n-gram model on the same corpus as Hindle et al. and evaluate it on open source projects, where it improves code suggestion accuracy over earlier n-gram models by 18 to 68 per cent (Nguyen et al. 2013). The model is still an n-gram capturing local statistical dependencies, but it now combines those with the global, technical structure of a semantically well-defined code language (Nguyen et al. 2013).

This section has shown two promising routes to better code-generating language models. The first is simply to scale the training corpus (Allamanis and Sutton 2013), an approach made possible by the availability of big data for code, often called big code (Allamanis et al. 2019). As in nearly every scientific discipline, the amount of available data has grown sharply over the past several years, and the trend is especially pronounced for real utterances of natural language and code, driven entirely by the internet. This has been a tremendous resource for statistical language models, and big code now features in almost every approach to code synthesis, so the use of ever larger corpora in language modelling is unsurprising. The second route, with a clearer narrative for the gains in language models, is to feed the model more information about longer-range statistical dependencies, allowing it to generate code from more context than n-grams can supply. The next section looks at increasingly powerful models that exploit longer context for more accurate code completion and generation, and examines the trend towards larger and more expensive systems.

2.4 More Context, More Power

This section traces the move to larger language models that can capture longer-range dependencies than n-gram models. It looks at the advances deep learning has brought by feeding more context into code generation, following the path from recurrent neural networks (RNNs) and long short-term memory (LSTM) models to the state-of-the-art high-context transformers.

The papers analysed in this section are among the most cited and influential covered by this review, with citation counts of the order of 10,000 between them. They have reshaped the landscape of both code and natural language generation over the last five years, and they have done so largely by exploiting more and more context.

2.4.1 RNNs and LSTM Networks

The first model to consider is the RNN applied to code generation. The starting point is the foundational 2015 paper on RNNs for code generation by White et al., published by IEEE, which presents an RNN and compares it against a baseline n-gram model for code generation (White et al. 2015). White et al. first consider the difference in expressive power between n-gram models and feed-forward networks. They argue that projecting a token into a feature space before applying an activation function produces highly expressive models (White et al. 2015), far more expressive than n-grams; so expressive, in fact, that overfitting becomes a real concern given the network’s capacity to learn complex representations (White et al. 2015). Even so, feed-forward networks are essentially higher-capacity n-grams; they do not reliably learn long-range dependencies and so do not bring in much more context. White et al. therefore set them aside in favour of a model better suited to sequential data, such as written code.

RNNs add short-term memory by feeding a hidden state back to an earlier layer. The hidden state vector carries the previous state forward, giving the model more context for its current prediction (White et al. 2015). RNNs can also process sequences with a variable number of computational steps, which is intuitively useful for tasks like code generation in a way that feed-forward networks are not. The recurrence lets an RNN learn beyond first-order temporal dependencies, and in theory it can use an arbitrary depth of context (White et al. 2015). The cost is that RNNs are inherently hard to train, owing to unstable gradients.

White et al. compare their RNN to an 8-gram model and find perplexity for the RNN to be lower by more than 2, at around PP\(\approx10\), a substantial improvement (White et al. 2015). RNNs also support online learning, performing back-propagation as test documents are presented, which drives perplexity down to PP\(\approx3.6\). The authors suggest that, for code completion or suggestion, a committee of static and dynamic RNNs could be used together (White et al. 2015). Adding a short-term memory mechanism, and with it the ability to learn deeper levels of context from code corpora, clearly improves the performance of code language models.

One of the most popular and successful RNN variants is the Long Short-Term Memory (LSTM) network. LSTMs were originally designed to mitigate the vanishing gradient problem that vanilla RNNs are notorious for (Bengio et al. 1994). In addition to the hidden state vector of an RNN, an LSTM maintains a memory state vector (Karpathy et al. 2015). The memory state vector lets gradients on the memory cells flow back through time over long ranges, unless interrupted by an active forget gate that causes the memory to partially discard old and irrelevant information (Karpathy et al. 2015; Choetkiertikul et al. 2018). For character-level natural language models, Karpathy et al.’s 2015 paper found that the success of LSTMs comes from the improved ability of a deep network to learn long-range structural dependencies (Karpathy et al. 2015). LSTMs are therefore better than RNNs at forming representations from longer context. Given the similarity between natural language and code, established in section 2.1, the same advantage is expected to carry over to code generation.

Dam et al. implemented an LSTM network in a 2016 preprint that has been highly influential in analysing how the architecture behaves on the same Java corpus used by Hindle et al. (Hindle et al. 2012; Dam et al. 2016). Dam et al. compare their LSTM directly to the vanilla RNN of White et al. (Dam et al. 2016) and use the same training corpus as Hindle et al., which gives strong support to their conclusion that a longer-context model outperforms a vanilla RNN. The LSTM consistently beats the baseline RNN they train. A particularly striking finding is that the LSTM’s advantage over the RNN grows with sentence length for a fixed embedding dimensionality (Dam et al. 2016). At the largest sentence length and embedding dimensionality the LSTM achieves a perplexity 37.9 per cent better than the RNN. Both networks are trained with the same adaptive stochastic gradient descent method (RMSprop). The improvement is attributed to the LSTM’s better handling of long-range dependencies, since it is far less exposed to the vanishing and exploding gradient problems that plague vanilla RNNs (Bengio et al. 1994; Dam et al. 2016). By allowing effective training over longer sequences than RNNs can manage, LSTMs make use of more context than any of the methods discussed so far.

Although the paper is a preprint, it is widely cited and treated as important in IEEE and ACM surveys on code generation (Allamanis et al. 2019; Hellendoorn and Devanbu 2017; Karampatsis et al. 2020). Two peer-reviewed papers in particular, written by teams that include some of the authors of Dam et al., build on those results to develop code generation tools aimed at speeding up software development (Choetkiertikul et al. 2018; Dam et al. 2018). In recent years LSTM networks have been adopted for a wide range of code generation and analysis tasks, including predicting vulnerabilities in source code (Dam et al. 2018), estimating effort for software projects to aid design and planning (Choetkiertikul et al. 2018), and correcting syntax errors (Santos et al. 2018).

The limitations of LSTM networks come from the fact that they encode and decode tokens sequentially, which caps the context they can use for accurate and useful code generation and analysis. The next section looks at how researchers extended the use of context further, leading to state-of-the-art language models such as GPT-3 that have become widely known of late.

2.4.2 Transformers

The state-of-the-art in language modelling is an architecture based entirely on an attention mechanism, with recurrence and convolutions removed from the network altogether. Recent transformer development has enabled several high-profile, powerful code-generating networks. The literature on code generation is discussed below, but first the transformer itself is briefly described.

Transformers were first introduced in the highly cited paper “Attention is all you need” by Vaswani et al., researchers at Google Brain⁵. The transformer is an encoder-decoder architecture in which the encoder produces an encoding that captures contextual information about how different parts of the input relate to one another (Vaswani et al. 2017). The work is done by a multi-headed attention block, made up of several self-attention layers and a fully connected feed-forward network (Vaswani et al. 2017). The multi-headed attention layer is so called because it combines several attention mechanisms, each encoding a different aspect of relevance within the input (Vaswani et al. 2017). In the code generation domain, one self-attention layer might encode a short-range dependency such as the requirement of an indent immediately after a Python function definition, while another encodes a longer-range dependency, such as the eventual need to end the Python function with, say, a “return Y”. Because these aspects are handled by different self-attention layers, the architecture parallelises extremely well, far better than an LSTM, and training is correspondingly faster (Vaswani et al. 2017). Using attention rather than recurrence also lets the model process the whole input at once, so context from a sequential input is immediately available. This avoids the training difficulties that affect RNNs and LSTMs, and lets short-range and long-range context be processed simultaneously (Vaswani et al. 2017). The decoder also uses self-attention, but with masks, so that predictions for position \(n\) only use outputs at positions less than \(n\) (Vaswani et al. 2017).

Although the original work focused on natural language, transformer-based models have since become the standard approach for code language models and code tasks. The efficiency gains from parallelisation both extend the usable context and allow training on larger corpora than RNNs and their variants, and both should translate into better performance on code generation. The next paragraphs cover some of the most cited applications of transformers to code generation tasks.

Svyatkovskiy et al. demonstrated the power of the transformer architecture in their 2020 paper, training a generative transformer on 1.2 billion lines of source code in Python, C# and TypeScript (Svyatkovskiy et al. 2020). The corpus size, together with the transformer’s contextual range and parallelisability, could be exploited at scale. For Python, their best model reached a perplexity of 1.82, and on a Python code completion task, 93 per cent of suggestions were syntactically correct (Svyatkovskiy et al. 2020).

The best-known code-generating transformers are those built on top of OpenAI’s GPT-3. GPT-3 (Generative Pre-trained Transformer 3) is a large language model trained on vast corpora, predominantly the natural language CommonCrawl corpus⁶. The researchers filtered this corpus and added other high-quality corpora (Brown et al. 2020), although what counts as high-quality is left unclear. Brown et al. note that a bug in filtering left some overlapping data between corpora. A model with the network capacity of GPT-3 (175 billion parameters) can learn the specificities of its training data to a high degree. Brown et al. state that GPT-3 is too expensive to retrain, which is itself a clear limitation of models at this scale.

Despite this, GPT-3 has proven a capable few-shot learner when fine-tuned on additional training data (Brown et al. 2020). The same approach turns the natural-language GPT-3 into the code generation model Codex (Chen et al. 2021). Codex is presented in a preprint by OpenAI researchers, who note that a distinct production version of Codex powers the commercial code generation product GitHub Copilot (Chen et al. 2021). The differences between the version reported by Chen et al. and the commercial system are not publicly known, though the training procedure is presumably similar. Chen et al. evaluate their network only on Python tasks, using their HumanEval benchmark of 164 human-written programming tasks, each with unit tests. They find that Codex solves these tasks 28.8 per cent of the time, compared to 0 per cent for GPT-3. This is not surprising, since Codex is fine-tuned directly on GitHub source code while GPT-3 is a natural language model (Chen et al. 2021). It also exposes a limitation of generative transformer models: they are strongly biased towards reproducing things they have seen in training (Chen et al. 2021), even when the resulting output does not solve the task. Chen et al. do not report a perplexity score, so an intrinsic comparison with other models is not possible. They do report that on HumanEval, Codex solves the tasks 70.2 per cent of the time when sampled 100 times (Chen et al. 2021). That is impressive set against the difficulties the shortest-context models, the n-grams, had when generating anything beyond a few tokens.

Intrinsic measurements such as perplexity do not capture everything that high-context language models can do, but they remain a simple measure that should at least correlate with a model’s ability to capture statistical dependencies in code. Their omission may be a side-effect of the rising cost and commercial character of both code and natural language models, with researchers at private companies being less forthcoming. As discussed in section 2.2, the research community would benefit from agreeing on a standard extrinsic metric for code generation, to enable simpler and more valid comparisons between models.

This issue is brought into focus by Xu et al.’s comparison of a transformer language model trained solely on code. Their model, PolyCoder, was trained on a mix of code languages, and on the HumanEval test (Chen et al. 2021) performed worse than similarly sized models trained on the mixed code and natural language dataset known as the Pile⁷ (Xu et al. 2022; Gao et al. 2020). Xu et al. point out that HumanEval is conducted in Python, only one of the languages PolyCoder was trained on, and that PolyCoder has seen fewer Python tokens than the Pile contains (Xu et al. 2022).

The picture so far is that increasingly powerful and robust code language models have tracked the models’ ability to capture short and long-range context in their training data. Chen et al. observe, however, that Codex tends to solve problems incorrectly by suggesting code that looks right but is mainly close to something the network saw during training (Chen et al. 2021). They argue that this misalignment problem is likely to worsen as training datasets and network capacity grow (Chen et al. 2021). The model’s ability to memorise its training data is, in a sense, too high; it can map new problems onto what it has seen before too aggressively, even when it is capable of producing the correct code (Chen et al. 2021). Mitigating misalignment between user intent and model output is therefore a promising area for future research, and may require breaking with the current strategy of pushing context understanding ever higher, since that strategy is what has produced the problem in the first place.

3 Summary & Conclusion

Across this review, the broad picture is that a machine’s ability to generate code correlates strongly with its ability to capture context at many different levels. The growth in context capacity has been driven both by the availability of training data, i.e. big code, and by architectural innovations that exploit those large corpora. Language models have grown so capable, and so expensive, that we now run into the opposite problem: they capture and memorise too much of their training data. That issue suggests several lines of future work. Reducing the largest models’ tendency to produce answers too close to their training data is one. A related but distinct direction is to make these models smaller and cheaper; smaller models may be inherently less exposed to the misalignment problem while also being commercially attractive on cost. The current state-of-the-art is fine-tuned to code from natural language models, so research into dedicated large code language models is one route to slimmer architectures. There has been some early work in this direction, such as Xu et al.’s 2022 PolyCoder, which was trained only on code corpora (Xu et al. 2022). PolyCoder did not match networks of similar size trained on both code and natural language, but it had also seen fewer tokens in the test language, Python, than the other models (Xu et al. 2022). Research into larger code-only datasets is therefore a concrete and immediate route for future work.

The lack of standard metrics was another recurring theme. Many authors built their own extrinsic metrics to evaluate their models, and some omitted intrinsic metrics such as perplexity altogether. This makes comparing and evaluating language models difficult. Continued research into evaluation metrics is therefore needed, as Chen et al. demonstrated with the HumanEval benchmark (Chen et al. 2021), later reused by Xu et al. (Xu et al. 2022). As noted above, HumanEval only tests a model’s ability in Python, producing the comparison issues seen in Xu et al.. Further research into multilingual metrics would help improve transparency in this domain.

The case for using probabilistic language models on code rests on the naturalness of code: even though code languages are mathematically well-formed, especially compared with natural language, the code humans actually write carries much of the same naturalness. Further work on the statistical structure of code languages may suggest changes to model architectures that capture context in written code more effectively. Code generation research at the moment borrows mostly from techniques developed for natural language. Developing code-specific techniques that exploit the mathematically well-formed nature of code and its highly reliable structural dependencies is a promising direction, and offers a possible middle path between the statistical and more traditional grammar-based research programmes. There is precedent for this. As section 2.3 noted, in Allamanis et al.’s 2014 paper an n-gram language model was paired with an algorithm that proposed only syntactically correct code snippets, with the language model used to score naturalness and stylistic consistency and to recommend code changes (Allamanis et al. 2014). A further precedent is Nguyen et al.’s semantic n-gram model (Nguyen et al. 2013), which incorporates global semantics into the model. These approaches worked well with short-context n-grams, and the same idea is an opportunity for today’s more expressive, longer-context models.

To close, the capability of language models for code generation has tracked their ability to capture context. That ability rests on the naturalness of code and on the rising expressiveness of the models themselves, which lets them capture the statistically driven structural dependencies that code carries.

References

Allamanis, Miltiadis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. “Learning Natural Coding Conventions.” Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 281–93.

Allamanis, Miltiadis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2019. “A Survey of Machine Learning for Big Code and Naturalness.” ACM Computing Surveys 51 (4): 1–37. https://doi.org/10.1145/3212695.

Allamanis, Miltiadis, and Charles Sutton. 2013. “Mining Source Code Repositories at Massive Scale Using Language Modeling.” 2013 10th Working Conference on Mining Software Repositories (MSR), 207–16.

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66.

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–901.

Chen, Mark, Jerry Tworek, Heewoo Jun, et al. 2021. “Evaluating Large Language Models Trained on Code.” arXiv Preprint arXiv:2107.03374.

Choetkiertikul, Morakot, Hoa Khanh Dam, Truyen Tran, Trang Pham, Aditya Ghose, and Tim Menzies. 2018. “A Deep Learning Model for Estimating Story Points.” IEEE Transactions on Software Engineering 45 (7): 637–56.

Chomsky, N. 1956. “Three Models for the Description of Language.” IEEE Transactions on Information Theory 2 (3): 113–24. https://doi.org/10.1109/tit.1956.1056813.

Dam, Hoa Khanh, Truyen Tran, and Trang Pham. 2016. “A Deep Language Model for Software Code.” arXiv Preprint arXiv:1608.02715.

Dam, Hoa Khanh, Truyen Tran, Trang Pham, Shien Wee Ng, John Grundy, and Aditya Ghose. 2018. “Automatic Feature Learning for Predicting Vulnerable Software Components.” IEEE Transactions on Software Engineering 47 (1): 67–85.

Gabel, Mark, and Zhendong Su. 2010. “A Study of the Uniqueness of Source Code.” Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, 147–56.

Gao, Leo, Stella Biderman, Sid Black, et al. 2020. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027.

Hellendoorn, Vincent J, and Premkumar Devanbu. 2017. “Are Deep Neural Networks the Best Choice for Modeling Source Code?” Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 763–73.

Hindle, Abram, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. “On the Naturalness of Software.” Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland), ICSE ’12, 837–47.

Karampatsis, Rafael-Michael, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. 2020. “Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code.” 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), 1073–85.

Karpathy, Andrej, Justin Johnson, and Li Fei-Fei. 2015. “Visualizing and Understanding Recurrent Networks.” arXiv Preprint arXiv:1506.02078.

Nguyen, Tung Thanh, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. “A Statistical Semantic Language Model for Source Code.” Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, 532–42.

Santos, Eddie Antonio, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. “Syntax and Sensibility: Using Language Models to Detect and Correct Syntax Errors.” 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 311–22.

Svyatkovskiy, Alexey, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. “Intellicode Compose: Code Generation Using Transformer.” Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1433–43.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

White, Martin, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. “Toward Deep Learning Software Repositories.” 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 334–45.

Xu, Frank F, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. “A Systematic Evaluation of Large Language Models of Code.” Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 1–10.