Naturalness, Context, and Code: The Rise of Code Generating Language Models

1 Introduction

Towards the end of 2022, the public release of a powerful language model capable of generating text from natural language prompts was a notable feature of the news cycle. This language model is owned by the non-profit company OpenAI¹ and is called ChatGPT². ChatGPT impressed many with its ability to generate convincingly natural text from simple natural language prompts. ChatGPT can also generate code from natural language prompts, demonstrating the practical utility of modern language models. State-of-the-art language models such as ChatGPT have the potential to reduce code development times through code completion and perform many other potentially useful tasks involving code generation.

The predominant focus of this review will be to identify the positive effect on a language model’s ability to generate code from its ability to capture context within code languages. To do this, we will first examine the theoretical justification for deploying natural language processing techniques in a code language domain and then track language model architectures’ capabilities as they leverage more and more context to generate code. We will survey the literature in this domain and see to what degree increased context capturing is evidence of greater performance. The literature is predominantly taken from peer-reviewed journals, and conferences. Where preprint articles are cited this is because they have a large number of citations and have been influential within the field of code-generating language models. Thus, the results from the preprints have been widely used and justified by empirical success. Additionally, the literature should provide places for improvement in this research domain, either through possible immediate routes to further language models for code generation or through limitations in research that need to be addressed.

The structure of this review will follow a broadly chronological order as the capturing of more context by language models has increased over time. In section 2.1, we will examine the foundation of this field of research and how the concept of naturalness was justified to exist in artificial and mathematically well-formed code languages. Thus, allowing the use of natural language processing techniques on code languages. In section 2.2, we shall outline the metrics used to evaluate language models. In section 2.3, we shall discuss early language models’ success in capturing short-range context in code. In section 2.4 we will review the improvements in capturing context deep learning has had in developing powerful code-generating language models. Finally, in section 3 we will summarise our findings, present our conclusions and note possible routes for future research.

2 Literature Review

2.1 The naturalness of code

The underlying concept that allows statistical techniques from natural language processing to be applied to code is recognising and accepting code as natural. This point is compellingly made by Hindle et al. in the proceedings of an IEEE conference on software engineering. The paper’s central claim is that actual code, i.e. code written by people, bares hallmarks of natural language, such as actual utterances generally being simple and repetitive (Hindle et al. 2012). Recognition that natural language possessed these qualities represented a dramatic shift from the dictionary and grammar-based approaches by researchers such as the famous Noam Chomsky (Chomsky 1956). Statistical methods applied to natural language from the 1980s onwards have yielded many empirically tested and successful technologies such as language translation and speech recognition. What Hindle et al. claim is that if code languages can be shown to possess a similar statistical nature to natural languages, then we are licensed to use empirically successful NLP techniques in the domain of artificial code languages.

Hindle et al. argue that everyday human communication evolved to be effective in noisy environments and quick. Therefore, simplicity and gaining understanding from identifying patterns, i.e., repetition, are key to understanding actual instances of natural language. Hindle et al. claim that this is true for code languages as often they are as much about communication with other humans as with the computer (Hindle et al. 2012). For most written programmes, there is the repetition of code snippets and preferred formats of code that whilst others are possible, are easy for humans to read and interpret, so become extremely common. This claim is supported by the earlier work of Gabel and Su presented in an ACM symposium on the foundations of source code, in which they analysed approximately 430 million lines of source code for uniqueness and found a distinct lack of it. They found a high level of repetition in their code corpus at a granularity of between 1-7 lines of code (Gabel and Su 2010).

Hindle et al. tested their claim that a statistical approach should be appropriate by creating an n-gram language model and then tested it to see if the language model captured regularities in software. We will discuss n-gram language models and their implementation and successes in section 2.3. Are these regularities due to code language syntax or “naturalness”? They found the language model to capture a high level of local regularity, more so than in English. They also found that the local regularity is not an artefact of language syntax, evidencing the existence of the so-called naturalness of code (Hindle et al. 2012). We have discussed the success their language model has had at capturing regularities in code languages, in section 2.2 we will illustrate how code language models are assessed and what more needs to be done assure they are being validly compared to one another.

This work by Hindle et al. has proven foundational to the field of probabilistic language models for code; evidence of this is a large number of citations (in the thousands) of their work and the dominance of language models in the decade since their publication. We could find no papers that explicitly argue against the validity of the use of statistical NLP methods for code languages, we believe this to be due to the proliferation and undeniable success of these methods in this domain. The remainder of this paper is predicated on the fact that researchers have been licensed to use NLP techniques to develop language models for code generation in several different ways, from the rudimentary n-grams to increasingly advanced deep learning techniques.

2.2 Metrics

Throughout this paper, researchers use different metrics to gauge their models’ abilities for code-generation tasks. One such type of metrics are intrinsic evaluation metrics, these may not capture all of the model’s abilities but are relatively standardised and used regularly. Researchers also use extrinsic evaluation methods to give a more rounded picture of their models’ abilities in real tasks the models are likely to encounter. However, extrinsic metrics vary more, and there is still no standard across the literature reviewed in this paper. In this section, we will discuss the most used intrinsic metrics and a recently created extrinsic metric developed by commercial researchers and has been adopted in a recent peer-reviewed paper.

2.2.1 Intrinsic metrics

The most frequent intrinsic metric seen in the papers reviewed in this review is perplexity. This is derived from the cross-entropy, an intrinsic metric in which we view the language model as a compression algorithm. The language model predicts the output; in a sense, it attempts to decompress the output from its input. The cross-entropy measures the average amount of added bits of information per code token that the language model requires to decompress the complete and correct output. The perplexity is the cross-entropy to the power of 2 and is commonly used instead of the cross-entropy. Lower values for both these related intrinsic metrics mean that the language model required fewer bits of information and is closer to an ideal model, which decompresses perfectly with no added information (Allamanis et al. 2019).

2.2.2 Extrinsic metrics

One particularly noteworthy addition to extrinsic metrics is the HumanEval test developed by Chen et al.. This dataset contains 164 human-written programming tasks developed specifically for testing the ability of a language model to perform code generation (Chen et al. 2021). Chen et al.’s paper describes the development of Codex³, a language model that has been used to power the commercial GitHub Copilot⁴; the exact development of the commercial Codex has not been publicly published although we infer it is similar to the version reported by Chen et al.. This is used by Xu et al. for their 2022 paper as a justification for using HumanEval to evaluate several different code-generating language models. Using HumanEval as their extrinsic metric, they can guarantee that their comparison tests are not contaminated with data on which the models have been trained (Xu et al. 2022). The focus of Xu et al.’s paper is to evaluate existing large language models for code generation tasks systematically. They note the trend for the expense of training the state-of-the-art models to be so high that it is predominantly done by large companies that do not create open source models which limits the research by lower resourced organisations (Xu et al. 2022). Because of this, we may see a trend of commercial entities developing the extrinsic metrics that are used by the research community, such as in the case of HumanEval.

2.3 Short Context Success

This section discusses a short context statistical technique that has proven very practical for capturing simple and local dependencies in code sequences. These are n-gram language models, such as the one used to support Hindle et al.’s theory of the naturalness of code (Hindle et al. 2012). We will first outline how n-gram models function, their limitations and how researchers have adapted and improved upon them.

N-gram models are sequence-based models that, in the case of code, statistically predict which token follows next in a sequence. N-grams are Markov models, they assume a Markov property in which the \(n-1\) tokens are the only tokens used to statistically model the \(n\)th token (Allamanis et al. 2019; Hindle et al. 2012; Nguyen et al. 2013; Hellendoorn and Devanbu 2017). Thus for a 3-gram (a.k.a. trigram) model, we can see the probability of the \(n\)th being approximated in Eq. 1..

\[\begin{equation} p(a_i|a_1 ... a_{i-1}) \approx p(a_i|a_{i-2} a_{i-1}) \end{equation}\]

The \(n\)th token is modelled by maximum-likelihood estimation from simple frequency counting within a training corpus.

It is also important to note that n-grams are left-to-right code generators (Allamanis et al. 2019; Nguyen et al. 2013). As anyone who has ever programmed will realise that is not how code is frequently written with some parts of code utilising other parts of code that are ordered and positioned quite a distance from one another. Whilst this may seem a sizeable conceptual problem with n-gram language models, all the language models discussed in this paper are also left-to-right generating and have still proven competent at code generation. This will allow us to compare our simplest and most complex language models.

The final broad technical point of n-grams that we should note is that there will surely be cases in which a corpus does not include examples of all possible n-grams. Of course, to reduce the possible number of combinations, we can reduce the value of \(n\); even with this, there will still be possible n-grams that are not in the training corpus. To allow for the infinitely improbable event of predicting an unseen n-gram, smoothing is used to rigorously assign non-zero probabilities to these unseen n-grams. Hindle et al. found that for their experiment, Kneser-Ney smoothing was effective (Hindle et al. 2012).

The impact of quantity and quality of data on n-grams in natural language is well known, more data is a reliable way to improve the performance of n-grams. This is precisely the approach taken by Allamanis and Sutton, in their 2013 IEEE paper they compose a large source code repository that is 100 times larger than the corpus used by Hindle et al. (Allamanis and Sutton 2013). Their corpus is the first for code that contains over a billion tokens; they term their resulting model a giga-token language model. Like Hindle et al.’s language model Allamanis and Sutton use a trigram (3-gram); however, the exact smoothing technique they use is not specified. Allamanis and Sutton report a decrease in perplexity by an order of magnitude compared to Hindle et al.’s earlier work (Allamanis and Sutton 2013). Whilst we would like to know the exact smoothing technique employed in this work, we can infer that the training corpus’s size has led to the performance gain. This is because the influence of how the language model handles unseen instances will decrease as the training corpus increases as more n-grams are observed. A key observation made by Allamanis and Sutton using a “collapsed model” in which the tokenizer converts all identifiers to a special identifier token and all literals to a special literal token. Thus, the tokenized corpus now only contains keywords and symbols which represent the more structural aspect of the code (Allamanis and Sutton 2013). The authors report that their collapsed model has learnt all it can about the syntax of code in the corpus after approximately 100,000 lines of code (Allamanis and Sutton 2013). This is a tiny fraction of the total 352 million lines of code in the corpus and provides supporting evidence for the fact that n-grams are very local and cannot handle long-range dependencies.

Allamanis and Sutton’s 2013 paper has been quite influential within the n-gram analysis of code languages and has received hundreds of citations. These authors continued to work with n-grams and their large training corpus to produce another highly cited work in 2014 that uses an n-gram language model to perform a real-world function. Their 2014 language model suggests changes to a programmer’s code that improve stylistic consistency and adherence to a project’s coding conventions (Allamanis et al. 2014). Allamanis et al. highlight an extrinsic measure of their work’s success to be the generation of 18 patches for open source projects, of which 14 were accepted (Allamanis et al. 2014). The way in which the language model is employed is quite novel, essentially the n-gram is used as a scoring function to determine the most “natural” option from alternative code snippets suggested by a proposer that contains some filtering logic. The easy extension of early n-gram work with an improved training corpus to a real code generation application demonstrates the usefulness and power of language models, even those with poor long-range context understanding.

Work by Nguyen et al. published from the proceedings of an ACM conference demonstrates another approach to improving short context n-gram language models for code generation. This is to try increasing the amount of context these language models can leverage when performing code generation tasks (Nguyen et al. 2013). They do this by incorporating semantic information into code tokens (Nguyen et al. 2013), this goes beyond the purely lexical and far more local information used in the n-gram models of Hindle et al., Allamanis and Sutton and Allamanis et al.. Ngyuen et al. argue that code has well-defined semantics that goes beyond the lexical level of analysis taken by previous n-gram language models (Hindle et al. 2012; Allamanis and Sutton 2013; Allamanis et al. 2014). Nguyen et al. achieve this by adding associated semantic information to lexical code tokens, which include its ID, role, data type, scope and structural and data dependencies, plus what the authors term its sememe, which is a structured annotation of its semantic value (Nguyen et al. 2013). Nguyen et al. trained their semantic n-gram model using the same corpus as Hindle et al., and the result is a semantic n-gram language model that when evaluated on open-source projects has a higher accuracy for code suggestion than previous n-gram models of between 18-68% (Nguyen et al. 2013). It is important to note that the n-gram language model is still capturing local statistical dependencies but that these are combined with the global technical considerations present in semantically well-defined code languages (Nguyen et al. 2013).

In this section, we have seen that two promising methods for improving the ability code generating language models have been to increase the size of the training corpus (Allamanis and Sutton 2013) an approach that has generally been fuelled by the availability of big data for code i.e., big code (Allamanis et al. 2019). As in almost all scientific disciplines, the availability of data has increased dramatically over the past several years, this trend has been particularly pronounced within the creation of real utterances of natural language and code language driven entirely by the internet. This has provided a tremendous resource for statistical language models, and big code has been used significantly in nearly every approach to code synthesis. So the use of increasingly large corpora in the development of language models is not surprising. The second approach that provides a clear narrative for the improvements in language models is the significant improvements based on the idea of including information about longer-range statistical dependencies in the model. In other words, increasing the ability of language models to generate code from more context than n-grams are capable of. In the next section, we look at increasingly powerful language models that leverage their ability to understand more context to produce more accurate code completion and generation models and examine the trend towards larger and more expensive language models.

2.4 More Context, More Power

In this section, we will examine the transition to larger language models far more capable of capturing longer-range dependencies than n-gram models. We will discuss the developments deep learning has brought by incorporating more context into code generation. We will observe how this process has moved from recurrent neural networks (RNNs) and long short-term memory (LSTM) models to state-of-the-art high-context capturing transformers.

This section will contain an analysis of the most cited and influential papers treated in this paper, total citations are of the order of 10,000 and have changed the landscape of both code and natural language generation in the last 5 years. These researchers have done this by utilising more and more context in their models.

2.4.1 RNNs and LSTM Networks

We will first consider the application of RNNs to the code generation task. To do this, we will examine neural networks’ advantages over n-gram models and then examine their successes and subsequent evolution. We will initially use the foundational paper of RNNs for code generation from 2015 by White et al.. In this IEEE-published paper, the authors present an RNN and compare its performance to a baseline n-gram model for code generation (White et al. 2015). White et al. first consider the difference in expressive power between n-gram models and feed-forward networks; they argue that the projection of a token into feature space before an activation function results in highly expressive models (White et al. 2015). Far more expressive than n-grams, so expressive that overfitting can be an issue due to the network’s ability to learn complex representations (White et al. 2015). However, feed-forward networks are simply more expressive models with greater capacity than n-grams; they do not reliably learn long-range dependencies and thus do not integrate more context. Thus, White et al. dismiss them in favour of a model more capable of processing sequential information, such as written code.

RNNs add short-term memory by copying a hidden state back to a previous layer. This provides more context for the current prediction as the previous state is fed back to an earlier layer through a hidden state vector (White et al. 2015). It also allows them to process sequences with a non-fixed number of computational steps; unlike feed-forward networks, this feature is intuitively useful in tasks such as code generation. This allows the RNN to learn beyond first-order temporal dependencies, and an RNN is theoretically able to utilise an arbitrary number of levels of context (White et al. 2015). However, this comes with some issues, such as being inherently difficult to train due to problems with unstable gradients.

White et al. compared their RNN language model to an 8-gram model and found perplexity for the RNN to be lower by more than 2, at around PP\(\approx10\) a significant performance gain (White et al. 2015). RNNs are also capable of online learning, i.e., performing back-propagation as test documents are presented, resulting in the very low perplexity of PP\(\approx3.6\). The authors suggest that for code completion or suggestion tasks, a committee of static and dynamic RNNs could be used to perform these tasks (White et al. 2015). Thus we have seen that the addition of a short-term memory mechanism and the ability to learn higher levels of context in code corpora have resulted in improved performance of code language models.

One of the most popular and successful variants of RNNs is the aforementioned Long Short-Term Memory (LSTM) networks. These networks were initially designed to mitigate the vanishing gradient problem that RNNs are infamous for suffering from (Bengio et al. 1994). In addition to the hidden state vector present in RNNs, LSTM networks maintain a memory state vector (Karpathy et al. 2015) The memory state vector is designed to allow gradients on the memory cells to flow back through time for long periods unless interrupted by an active forget gate which causes the memory vector to partially forget old and irrelevant information (Karpathy et al. 2015; Choetkiertikul et al. 2018). In the domain of character-level natural language models, the 2015 paper by Karpathy et al. found that the success of LSTMs lies in the improvements in a deep neural network’s ability to learn long-range structural dependencies (Karpathy et al. 2015). The LSTM network is, therefore, more capable of forming representations using context of a greater length than RNNs. The similarity between natural language and code language, as examined in section 2.1, suggests that this is the same for LSTM networks when applied to code generation tasks.

Work by Dam et al. implemented an LSTM network in 2016, their preprint paper has been highly influential in analysing the ability of the LSTM architecture on the same java corpus as Hindle et al. (Hindle et al. 2012; Dam et al. 2016). Dam et al. explicitly compares their LSTM network to the vanilla RNN developed in the paper by White et al. (Dam et al. 2016) and also use the same training corpus as Hindle et al.. This allows strong support for the conclusions drawn by Dam et al. regarding the superiority of a longer context model compared to a vanilla RNN. Dam et al. observe the LSTM network consistently performing better than the RNN that they develop as a baseline. One particularly interesting finding is that the LSTM network performs better than the RNN the longer the sentence length for a fixed embedding dimensionality (Dam et al. 2016). At the largest sentence length and embedding dimensionality, the LSTM network possessed a perplexity with a 37.9% improvement over the RNN. Both the RNN and LSTM networks are trained using the same adaptive stochastic gradient descent methods (RMSprop). The increased performance of the LSTM network is attributed to its improved ability to learn long-range dependencies as it does not suffer as severely from the vanishing/exploding gradient issues that afflict vanilla RNNs (Bengio et al. 1994; Dam et al. 2016). Thus, allowing effective training over longer periods than RNNs can, LSTM networks utilise an improved ability to learn using more context than all previous methods we have discussed.

While this paper is a preprint, it is widely cited and considered important by surveys on code generation from IEEE, and ACM (Allamanis et al. 2019; Hellendoorn and Devanbu 2017; Karampatsis et al. 2020). Two peer-reviewed papers, in particular, are developed by teams consisting of some of the researchers involved in Dam et al. and use the results of Dam et al. to further the development of code generation tools to improve the speed of software development (Choetkiertikul et al. 2018; Dam et al. 2018). In recent years we have seen the adoption of LSTM networks for a wide range of code generation and analysis tasks. Such as predicting vulnerabilities in written code (Dam et al. 2018), estimating the effort required to complete software projects to aid the software design and planning process (Choetkiertikul et al. 2018) and correcting syntax errors (Santos et al. 2018).

Limitations in using LSTM networks are predicated on how they encode and decode tokens sequentially. This limits the context they can understand and use to produce accurate and helpful code generation and analysis instances. In the next section, we will examine and discuss the methods used to improve the use of context further to create state-of-the-art language models such as GPT-3 that have become famous as of late.

2.4.2 Transformers

The state-of-the-art in language models is a type of model based entirely upon an attention mechanism with recurrence or convolutions entirely removed from the network. The recent development of transformers has allowed for the creation of several high-profile and powerful code-generating networks. We will discuss literature on code generation tasks; however, first, we will examine what a transformer is and how it functions.

Transformers were first developed in the highly cited paper “Attention is all you need” by Vaswani et al., researchers at Google Brain⁵. The transformer is an encoder-decoder architecture where the encoding block generates an encoding that contains contextual information about how different parts of the input are related to one another (Vaswani et al. 2017). This is done through a multi-headed attention block which consists of different self-attention layers and a fully connected feed-forward neural network (Vaswani et al. 2017). The multi-headed attention layer is so called because it consists of several different attention mechanisms that encode different aspects of relevance within the input (Vaswani et al. 2017). An example of this within our code generation domain would be that one self-attention layer encodes a short-range dependency i.e. the requirement of an indent immediately after the instigation of a function in python, whilst another self-attention head encodes a longer-range dependency, such as the need to end the python function with an e.g. “return Y”. The fact that this is done by different self-attention layers allows for a high degree of parallelization, certainly far more than is possible in an LSTM network, allowing for a significant improvement in training time (Vaswani et al. 2017). Using attention mechanisms instead of recurrence mechanisms also allows the whole input to be processed simultaneously, allowing immediate use of context from sequential inputs. This avoids the inherent difficulties in training RNNs and LSTM networks and means that both short and long-range context is processed simultaneously (Vaswani et al. 2017). The decoder also contains self-attention layers; these layers contain masks so that the decoder only makes predictions for the \(n\)th position using information from outputs less than \(n\) (Vaswani et al. 2017).

Whilst focusing on natural language, transformer-based language models have become the standard approach in code language models and tasks. Their efficiency results from parallelization both allow the incorporation of more context and the ability to train with larger corpora than RNNs and their variants. Both factors should yield improvements in the code generation domain. We now discuss some of the most cited examples of the implementation of transformers in code generation tasks.

Svyatkovskiy et al. demonstrated the power of the transformer architecture in their 2020 paper that trained a generative transformer model on 1.2 billion lines of source code in python, C# and typescript programming languages (Svyatkovskiy et al. 2020). The training corpus size and the transformer architecture’s contextual understanding and parallelization abilities could be leveraged significantly. In python, their best model achieved a perplexity of 1.82. Additionally, they found that when performing a code completion task in python, 93% of the suggestions were syntactically correct (Svyatkovskiy et al. 2020).

The most famous example of code-generating transformer language models are those based on the GPT-3 model developed by OpenAI. GPT-3 (Generative Pre-trained Transformer 3) is a large language model trained on vast corpora, predominantly the natural language CommonCrawl corpus⁶. The researchers performed some filtering on this corpus and added some other high-quality corpora (Brown et al. 2020), although what they consider to be high-quality is unclear. Brown et al. note that due to a bug during filtering, there were some overlapping pieces of data between corpora. A model with as high a network capacity (175 billion parameters) as GPT-3 can learn the specificities of its training data to a high degree. Brown et al. state that GPT-3 is too expensive to train for it to be redone, highlighting a limitation of such large models.

Despite this issue, GPT-3 has proven to be a capable few-shot learner when fine-tuned with additional training data (Brown et al. 2020). This method is employed to alter the natural language GPT-3 into the code generation language model Codex (Chen et al. 2021). Codex is presented in a preprint paper authored by researchers working for OpenAI; they note in the paper that a distinct production version of Codex powers the commercial code generation programme GitHub copilot (Chen et al. 2021). We must note that the differences between the version of Codex presented in Chen et al. and the commercial version are not publicly known. We can infer how the transformer network is trained will be similar. Chen et al.’s network was evaluated solely on python code tasks using an evaluation system they developed called HumanEval consisting of 164 human-written programming tasks, each containing unit tests. The authors found that Codex solves these tasks 28.8% of the time compared to GPT-3’s 0%. This is not surprising as Codex is fine-tuned directly on GitHub source code, whereas GPT-3 is a natural language model (Chen et al. 2021). This also exposes a limitation of such generative transformer language models; they are highly likely to produce results they have seen in their training data (Chen et al. 2021), even when the resulting output does not perform the task correctly. Chen et al. do not provide a perplexity score for their network so we cannot compare its performance intrinsically to other language models. We can note that for their HumanEval tests, Codex solved these problems 70.2% of the time when sampled 100 times (Chen et al. 2021). This is quite impressive when we compare the difficulties our shortest context language models, n-grams, had when attempting generation beyond single-digit grams.

Whilst intrinsic measurements such as perplexity do not capture all the abilities of high-context language models, it is a simple measure that nonetheless should still somewhat correlate to a model’s ability to capture statistical dependencies in code languages. Omitting specific metrics may be a side-effect of the increasing cost and commercial nature of both code and natural language models resulting in less transparency from researchers developing these models for private companies. As discussed in section 2.2, it would be advantageous for the research community to accept a standard extrinsic metric for code generation tasks to allow for simpler and more valid comparisons between models.

This issue is highlighted by the comparison of a transformer language model trained solely on code data by Xu et al.. Their model, called PolyCoder, was trained on a variety of different code languages, this model then performed the HumanEval test (Chen et al. 2021) worse than models of a similar size trained on a mixed code and natural language dataset, the Pile⁷ (Xu et al. 2022; Gao et al. 2020). Xu et al. note that the HumanEval test is conducted in python, only one of the languages that PolyCoder is trained on. PolyCoder has been trained on fewer python tokens than are contained in the Pile (Xu et al. 2022).

We have observed that the development of increasingly powerful and robust code language models has followed the ability of these models to capture short and long-range context within their training data. However, Chen et al. notes the propensity of Codex to solve problems incorrectly by suggesting code that appears correct but is not and is just more similar to code the network saw during training (Chen et al. 2021). The authors note that this misalignment problem will likely worsen as training datasets get larger and the expressive power of the networks increases (Chen et al. 2021). Arguably the ability of the model to memorise its training data is too high, it can contextualise new problems too well into precisely what it has seen before, even though it is capable of producing the correct code (Chen et al. 2021). Mitigating misalignment between user and language models is certainly an exciting domain for future research as it may break with the current methods used to increase context understanding by the models as this approach has produced the misalignment issue.

3 Summary & Conclusion

Throughout this review, we have seen that a machine’s ability to perform code generation tasks strongly correlates with the model’s ability to understand the context at many disparate levels. The power of language models to understand more and more context has been driven by both the availability of training data, i.e., big code, and innovations in architecture that leverage the vast training corpora. The power and expense of language models have increased so significantly that we now approach the issue that they are too capable of capturing context and memorising their training data. This issue presents a number of opportunities for future research; reducing the propensity of the largest models to give answers too similar to their training data is one such avenue. A related but distinct route of future research may be to make these models smaller and less expensive; smaller and cheaper models may be inherently less exposed to the misalignment problem whilst possessing commercial advantages at reduced expense. The state-of-the-art is currently fine-tuned to code from natural language models; research into specific large code language models may be at least one route to diminishing their size. There has been some early research into this with Xu et al’s 2022 PolyCoder model that has only been trained on code corpora (Xu et al. 2022). Whilst the performance of PolyCoder was not as good as that of similarly sized networks trained on code and natural language, PolyCoder had seen fewer overall tokens than the other models in the test language i.e., python (Xu et al. 2022). Thus, research into larger code-only datasets is a detailed and immediate route for future work.

The issue of standard metrics was also highlighted in this paper. Many authors created their own extrinsic metrics to assess their models, and some did not include intrinsic metrics such as perplexity. This results in difficulties comparing and evaluating language models. Therefore, continued research into evaluation metrics is required, this has been done by Chen et al. when developing the HumanEval (Chen et al. 2021) test and was later used by Xu et al. (Xu et al. 2022). Although, as we have highlighted, HumanEval only assesses a model’s ability in python, resulting in the comparison issues we saw in Xu et al.. Further research into multilingual metrics is needed to further transparency in this research domain.

The validity of using probabilistic language models in the code domain has been predicated on the naturalness of code, which, whilst being mathematically well-formed, especially compared to natural language, possesses much of the same naturalness when written by humans. Future research into the statistical structure of code languages may yield results to improve and change model architectures to better capture the context in written code. Currently, code generation research predominantly borrows techniques developed within the field of natural language processing. Developing code-specific techniques that leverage the mathematically well-formed nature of code languages that result in highly reliable structural dependencies is a promising route for future research. Presenting a possible middle path between the statistical and more traditional grammar-based research programmes. There is a precedent for such research, as we saw in section 2.3 in Allamanis et al.’s 2014 paper, an n-gram language model was used in cooperation with an algorithm that proposed only syntactically correct code snippets. The language model was used to quantify naturalness and stylistic consistency and recommend code changes based upon this (Allamanis et al. 2014). Another precedent set by n-grams is the semantic n-gram model of Nguyen et al. (Nguyen et al. 2013), which incorporates global semantics into the model. These approaches performed well with short-context n-grams; thus, such an approach presents an opportunity for our state-of-the-art, more expressive, longer-context models.

To end, we should note that the ability and power of language models for code generation have tracked with these models’ capability to capture context. This ability to capture context comes from the naturalness of code and language models’ increasing expressiveness to capture these statistically driven structural dependencies.

References

Allamanis, Miltiadis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. “Learning Natural Coding Conventions.” Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 281–93.

Allamanis, Miltiadis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2019. “A Survey of Machine Learning for Big Code and Naturalness.” ACM Computing Surveys 51 (4): 1–37. https://doi.org/10.1145/3212695.

Allamanis, Miltiadis, and Charles Sutton. 2013. “Mining Source Code Repositories at Massive Scale Using Language Modeling.” 2013 10th Working Conference on Mining Software Repositories (MSR), 207–16.

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66.

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–901.

Chen, Mark, Jerry Tworek, Heewoo Jun, et al. 2021. “Evaluating Large Language Models Trained on Code.” arXiv Preprint arXiv:2107.03374.

Choetkiertikul, Morakot, Hoa Khanh Dam, Truyen Tran, Trang Pham, Aditya Ghose, and Tim Menzies. 2018. “A Deep Learning Model for Estimating Story Points.” IEEE Transactions on Software Engineering 45 (7): 637–56.

Chomsky, N. 1956. “Three Models for the Description of Language.” IEEE Transactions on Information Theory 2 (3): 113–24. https://doi.org/10.1109/tit.1956.1056813.

Dam, Hoa Khanh, Truyen Tran, and Trang Pham. 2016. “A Deep Language Model for Software Code.” arXiv Preprint arXiv:1608.02715.

Dam, Hoa Khanh, Truyen Tran, Trang Pham, Shien Wee Ng, John Grundy, and Aditya Ghose. 2018. “Automatic Feature Learning for Predicting Vulnerable Software Components.” IEEE Transactions on Software Engineering 47 (1): 67–85.

Gabel, Mark, and Zhendong Su. 2010. “A Study of the Uniqueness of Source Code.” Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, 147–56.

Gao, Leo, Stella Biderman, Sid Black, et al. 2020. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027.

Hellendoorn, Vincent J, and Premkumar Devanbu. 2017. “Are Deep Neural Networks the Best Choice for Modeling Source Code?” Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 763–73.

Hindle, Abram, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. “On the Naturalness of Software.” Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland), ICSE ’12, 837–47.

Karampatsis, Rafael-Michael, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. 2020. “Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code.” 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), 1073–85.

Karpathy, Andrej, Justin Johnson, and Li Fei-Fei. 2015. “Visualizing and Understanding Recurrent Networks.” arXiv Preprint arXiv:1506.02078.

Nguyen, Tung Thanh, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. “A Statistical Semantic Language Model for Source Code.” Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, 532–42.

Santos, Eddie Antonio, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. “Syntax and Sensibility: Using Language Models to Detect and Correct Syntax Errors.” 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 311–22.

Svyatkovskiy, Alexey, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. “Intellicode Compose: Code Generation Using Transformer.” Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1433–43.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

White, Martin, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. “Toward Deep Learning Software Repositories.” 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 334–45.

Xu, Frank F, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. “A Systematic Evaluation of Large Language Models of Code.” Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 1–10.