Assessment Systems

What is automated essay scoring?

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment.  In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public!  The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays.  Instead, you need to train an algorithm.  That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system.  You have to  actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm.  Data scientists use the term train the model , which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

There are three steps for automated essay scoring:

  • Establish your data set (collate student essays and grade them).
  • Determine the features (predictor variables that you want to pick up on).
  • Train the machine learning model.

Here’s an extremely oversimplified example:

  • You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
  • The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena.  You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
  • You create a map of which students used each of these words, as 0/1 indicator variables.  You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable.  You can then use this model to predict each student’s score from just their essay text.

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies.  The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay.  Imagine that you also have data on the final scores, 0 to 5 points.  You can see how this is then a regression situation.

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text.  Most likely, you have delivered the exam as an online assessment and already have the database.  If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics.  An example of our FastTest platform is provided below.

FastTest_essay-marking

Some rubrics you might use:

  • Supporting arguments
  • Organization
  • Vocabulary / word choice

How do you pick the Features?

This is one of the key research problems.  In some cases, it might be something similar to the Napoleon example.  Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points.  You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features.  I saw a presentation at Innovations In Testing 2022 that did exactly this.  Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point.  It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know.  Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday.  You have no idea what they will write about.  But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that.  Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores.  The AI might also pick up on spelling errors.  I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument.  They had literally never seen the word used or spelled correctly.  Maybe the AI model finds to give that a negative weight.   That’s the next section!

How do you train a model?

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options.  This is where more research is required.  What model works the best on your particular essay, and doesn’t take 5 days to run on your data set?  That’s for you to figure out.  There is a trade-off between simplicity and accuracy.  Complex models might be accurate but take days to run.  A simpler model might take 2 hours but with a 5% drop in accuracy.  It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How well does automated essay scoring work?

Well, as psychometricians love to say, “it depends.”  You need to do the model fitting research for each prompt and rubric.  It will work better for some than others.  The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role.  But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES.  The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case!  Some of the models correlated above 0.80 with humans, even though this is a small data set.   We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

Automated essay scoring results

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market.  Some are standalone, some are integrated with a human-based essay scoring platform.  ASC’s platform for automated essay scoring is SmartMarq; click here to learn more .  It is currently in a standalone approach like you see below, making it extremely easy to use.  It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more?  Contact us to request a demonstration .

SmartMarq automated essay scoring

  • Latest Posts

Avatar for Nathan Thompson, PhD

Nathan Thompson, PhD

Latest posts by nathan thompson, phd ( see all ).

  • Likert Scale Items - February 9, 2024
  • Test Blueprints & Specifications - January 30, 2024
  • What is a testlet? - January 17, 2024

Online Testing Solutions

LinkedIn Assessment Systems

Psychometrics

  • Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, explainable automated essay scoring: deep learning really has pedagogical value.

about automated essay scoring

  • School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

www.frontiersin.org

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

www.frontiersin.org

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

www.frontiersin.org

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

www.frontiersin.org

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

www.frontiersin.org

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

www.frontiersin.org

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

www.frontiersin.org

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

www.frontiersin.org

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

www.frontiersin.org

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

www.frontiersin.org

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

www.frontiersin.org

Figure 4. Summary plot listing the 32 most important features globally.

www.frontiersin.org

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

www.frontiersin.org

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

www.frontiersin.org

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

www.frontiersin.org

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

  • ^ https://www.kaggle.com/c/asap-aes
  • ^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

This article is part of the Research Topic

Learning Analytics for Supporting Individualization: Data-informed Adaptation of Learning

e-rater ®  Scoring Engine

Evaluates students’ writing proficiency with automatic scoring and feedback

Selection an option below to learn more.

About the e-rater Scoring Engine

The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer’s grammar, mechanics, word use and complexity, style, organization and more.

Who uses the e-rater engine and why?

Companies and institutions use this patented technology to power their custom applications.

The e-rater engine is used within the  Criterion ®  Online Writing Evaluation Service . Students use the e-rater engine's feedback to evaluate their essay-writing skills and to identify areas that need improvement. Teachers use the Criterion service to help their students develop their writing skills independently and receive automated, constructive feedback. The e-rater engine is also used in other low-stakes practice tests include TOEFL ®  Practice Online and GRE ®  ScoreItNow!™.

In high-stakes settings, the engine is used in conjunction with human ratings for both the Issue and Argument prompts of the GRE test's Analytical Writing section and the TOEFL iBT ®  test's Independent and Integrated Writing prompts. ETS research has shown that combining automated and human essay scoring demonstrates assessment score reliability and measurement benefits.

For more information about the use of the e-rater engine, read  E-rater as a Quality Control on Human Scores (PDF) .

How does the e-rater engine grade essays?

The e-rater engine provides a holistic score for an essay that has been entered into the computer electronically. It also provides real-time diagnostic feedback about grammar, usage, mechanics, style and organization, and development. This feedback is based on NLP research specifically tailored to the analysis of student responses and is detailed in  ETS's research publications (PDF) .

How does the e-rater engine compare to human raters?

The e-rater engine uses NLP to identify features relevant to writing proficiency in training essays and their relationship with human scores. The resulting scoring model, which assigns weights to each observed feature, is stored offline in a database that can then be used to score new essays according to the same formula.

The e-rater engine doesn’t have the ability to read so it can’t evaluate essays the same way that human raters do. However, the features used in e-rater scoring have been developed to be as substantively meaningful as they can be, given the state of the art in NLP. They also have been developed to demonstrate strong reliability — often greater reliability than human raters themselves.

Learn more about  how it works .

About Natural Language Processing

The e-rater engine is an artificial intelligence engine that uses Natural Language Processing (NLP), a field of computer science and linguistics that uses computational methods to analyze characteristics of a text. NLP methods support such burgeoning application areas as machine translation, speech recognition and information retrieval.

Ready to begin? Contact us to learn how the e-rater service can enhance your existing program.

Young man with glasses and holding up a pen in a library

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, automated essay scoring.

24 papers with code • 1 benchmarks • 1 datasets

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

Source: A Joint Model for Multimodal Document Quality Assessment

Benchmarks Add a Result

Latest papers, autoregressive score generation for multi-trait essay scoring.

about automated essay scoring

Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score.

Can Large Language Models Automatically Score Proficiency of Written Essays?

watheq9/aes-with-llms • 10 Mar 2024

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness.

From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape

Receiving immediate and personalized feedback is crucial for second-language learners, and Automated Essay Scoring (AES) systems are a vital resource when human instructors are unavailable.

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability

Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays.

Learning to love diligent trolls: Accounting for rater effects in the dialogue safety task

michaeljohnilagan/aestrollhunting • 30 Oct 2023

However, among users are trolls, who provide training examples with incorrect labels.

Modeling Structural Similarities between Documents for Coherence Assessment with Graph Convolutional Networks

Coherence is an important aspect of text quality, and various approaches have been applied to coherence modeling.

Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES.

WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia

ken-ando/wikisqe • 10 May 2023

Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.

H-AES: Towards Automated Essay Scoring for Hindi

midas-research/hindi-aes • 28 Feb 2023

In this study, we reproduce and compare state-of-the-art methods for AES in the Hindi domain.

On the Use of BERT for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation

In recent years, pre-trained models have become dominant in most natural language processing (NLP) tasks.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Entropy (Basel)

Logo of entropy

Improving Automated Essay Scoring by Prompt Prediction and Matching

1 School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China

Tianbao Song

2 School of Computer Science and Engineering, Beijing Technology and Business University, Beijing 100048, China

Weiming Peng

Associated data.

Publicly available datasets were used in this study. These data can be found here: http://hsk.blcu.edu.cn/ (accessed on 6 March 2022).

Automated essay scoring aims to evaluate the quality of an essay automatically. It is one of the main educational application in the field of natural language processing. Recently, Pre-training techniques have been used to improve performance on downstream tasks, and many studies have attempted to use pre-training and then fine-tuning mechanisms in an essay scoring system. However, obtaining better features such as prompts by the pre-trained encoder is critical but not fully studied. In this paper, we create a prompt feature fusion method that is better suited for fine-tuning. Besides, we use multi-task learning by designing two auxiliary tasks, prompt prediction and prompt matching, to obtain better features. The experimental results show that both auxiliary tasks can improve model performance, and the combination of the two auxiliary tasks with the NEZHA pre-trained encoder produces the best results, with Quadratic Weighted Kappa improving 2.5% and Pearson’s Correlation Coefficient improving 2% on average across all results on the HSK dataset.

1. Introduction

Automated essay scoring (AES), which aims to automatically evaluate and score essays, is one typical application of natural language processing (NLP) technique in the field of education [ 1 ]. In earlier studies, a combination of handcrafted design features and statistical machine learning is used [ 2 , 3 ], and with the development of deep learning, neural network-based approaches gradually become mainstream [ 4 , 5 , 6 , 7 , 8 ]. Recently, pre-trained language models have gradually become the foundation module of NLP, and the paradigm of pre-training, then fine-tuning, is also widely adopted. Pre-training is the most common method for transfer learning, in which a model is trained on a surrogate task and then adapted to the desired downstream task by fine-tuning [ 9 ]. Some research has attempted to use pre-training modules in AES tasks [ 10 , 11 , 12 ]. Howard et al. [ 10 ] utilize the pre-trained encoder as a feature extraction module to obtain a representation of the input text and update the pre-trained model parameters based on the downstream text classification task by adding a linear layer. Rodriguez et al. [ 11 ] employ a pre-trained encoder as the essay representation extraction module for the AES task, with inputs at various granularities of the sentence, paragraph, overall, etc., and then use regression as the training target for the downstream task to further optimize the representation. In this paper, we fine-tune the pre-trained encoder as a feature extraction module and convert the essay scoring task into regression as in previous studies [ 4 , 5 , 6 , 7 ].

The existing neural methods obtain a generic representation of the text through a hierarchical model using convolutional neural networks (CNN) for word-level representation and long short-term memory (LSTM) for sentence-level representation [ 4 ], which is not specific to different features. To enhance the representation of the essay, some studies have attempted to incorporate features such as prompt [ 3 , 13 ], organization [ 14 ], coherence [ 2 ], and discourse structure [ 15 , 16 , 17 ] into the neural model. These features are critical for the AES task because they help the model understand the essay while also making the essay scoring more interpretable. In actual scenarios, prompt adherence is an important feature in essay scoring tasks [ 3 ]. The hierarchical model is insensitive to changes in the corresponding prompt for the essay and always assigns the same score for the same essay, regardless of the essay prompt. Persing and Ng [ 3 ] propose a feature-rich approach that integrates the prompt adherence dimension. Ref. [ 18 ] improves document modeling with a topic word. Li et al. [ 7 ] utilizes a hierarchical structure with an attention mechanism to construct prompt information. However, the above feature fusion methods are unsuitable for fine-tuning.

The two challenges in effectively incorporating pre-trained models into AES feature representation are the data dimension and the methodological dimension. For the data dimension, the use of fine-tuning approaches to transfer the pre-trained encoder to downstream tasks frequently necessitates sufficient data, and there has been more research on both training and testing data from the same target prompt [ 4 , 5 ], but the data size is relatively small, varying between a few hundred and a few thousand, and pre-trained encoders cannot be fine-tuned well. In order to solve this challenge, we use the whole training set, which includes various prompts. In terms of methodology, we employ the pre-training and multi-task learning (MTL) paradigms, which can learn features that cannot be learned in a single task through joint learning, learning to learn, and learning with auxiliary tasks [ 19 ], etc. MTL methods have been applied to several NLP tasks, such as text classification [ 20 , 21 ], semantic analysis [ 22 ] et al. Our method creates two auxiliary tasks that need to be learned alongside the main task. The main task and auxiliary tasks can increase each other’s performance by sharing information and complementing each other.

In this paper, we propose an essay scoring model based on fine-tuning that utilizes multi-task learning to fuse prompt features by designing two auxiliary tasks, prompt prediction, and prompt matching, which is more suitable for fine-tuning. Our approach can effectively incorporate the prompt feature in essays and improve the representation and understanding of the essay. The paper is organized as follows. In Section 2 , we first review related studies. We describe our method and experiment in Section 3 and Section 4 . Section 5 presents the findings and discussions. Finally, in Section 6 , we provide a conclusion, future work, and the limitations of the paper.

2. Related Work

Pre-trained language models, such as BERT [ 23 ], BERT-WWM [ 24 ], RoBERTa [ 25 ], and NEZHA [ 26 ], have gradually become a fundamental technique for NLP, with great success on both English and Chinese tasks [ 27 ]. In our approach, we use the BERT and NEZHA feature extraction layers. BERT is the abbreviation of Bidirectional Encoder Representations from Transformers, and it is based on transformer blocks that are built using the attention mechanism [ 28 ] to extract semantic information. It is trained on two unsupervised tasks using large-scale datasets: masked language model (MLM) and next sentence prediction (NSP). NEZHA is a Chinese pre-training model that employs functional relative positional encoding and whole word masking (WWM) rather than BERT. The pre-training then the fine-tuning mechanism is widely used in downstream NLP tasks, including AES [ 11 , 12 , 15 ]. Mim et al. [ 15 ] propose a pre-training approach for evaluating the organization and argument strength of essays based on modeling coherence. Song et al. [ 12 ] present a multi-stage pre-training method for automated Chinese essay scoring that consists of three components: weakly supervised pre-training, supervised cross-prompt fine-tuning, and supervised target-prompt fine-tuning. Rodriguez et al. [ 11 ] use BERT and XLNET [ 29 ] for representation and fine-tuning of English corpus.

The essay prompt introduces the topic, offers concepts, and restricts both content and perspective. Some studies have attempted to enhance the AES system by incorporating prompt features in many ways, such as by integrating prompt information to determine if an essay is off-topic [ 13 , 18 ] or by considering prompt adherence as a crucial indicator [ 3 ]. Louis and Higgins [ 13 ] improve model performance by expanding prompt information with a list of related words and reducing spelling errors. Persing and Ng [ 3 ] propose a feature-rich method for incorporating the prompt adherence dimension via manual annotation. Klebanov et al. [ 18 ] also improve essay modeling with topic words to quantify the overall relevance of the essay to the prompt, and the relationship between prompt adherence scores and total essay quality is also discussed. The methods described above mostly employ statistical machine learning, prompt information is enriched by annotation and the construction of datasets, as well as the construction of word lists and topic word mining. While all of them are making great progress, the approaches they are employing are more difficult to directly transfer to fine-tuning. Li et al. [ 7 ] propose a shared model and an enhanced model (EModel), and utilize a neural network hierarchical structure with an attention mechanism to construct features of the essay such as discourse, coherence, relevancy, and prompt. For the representation, the paper employs GloVe [ 30 ] rather than a pre-trained model. In the experiment section, we compared our method to the sub-module of EModel (Pro.) which incorporates the prompt feature.

3.1. Motivation

Although previous studies on automated essay scoring models for specific prompts have shown promising results, most research focuses on generic features of essays. Only a few studies have focused on prompt feature extraction, and no one has attempted to use a multi-task approach to make the model capture prompt features and be sensitive to prompts automatically. Our approach is motivated by capturing prompt features to make the model aware of the prompt and using pre-training and then the fine-tuning mechanism for AES. Based on this motivation, we use a multi-task learning approach to obtain features that are more applicable to Essay Scoring (ES) by adding essay prompts to the model input and proposing two auxiliary tasks: Prompt Prediction ( PP ) and Prompt Matching ( PM ). The overall architecture of our model is illustrated in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g001.jpg

The proposed framework. “一封求职信” is the prompt of the essay, the English translation is “A cover letter”. “主管您好” means “Hello Manager”. The prompt and essay are separated by [SEP].

3.2. Input and Feature Extraction Layer

The input representation for a given essay is built by adding the corresponding token embeddings E t o k e n , segment embeddings E s e g m e n t , and position embeddings E p o s i t i o n . To fully exploit the prompt information, we concatenate the prompt in front of the essay. The first token of each input is a special classification token [CLS], and the prompt and essay are separated by [SEP]. The token embedding of the j -th essay in the i -th prompt can be expressed as Equation ( 1 ), E s e g m e n t and E p o s i t i o n are obtained from the tokenizer of the pre-train encoder.

We utilize the BERT and NEZHA as feature extraction layers. The final hidden state corresponding to the [CLS] token is the essay representation r e for essay scoring and subtasks.

3.3. Essay Scoring Layer

We view essay scoring as a regression task. To enable data mapping regression problems, the real scores are scaled to the range [ 0 , 1 ] for training and rescaled during evaluation, according to the existing studies:

where s i j is the scaled score for i -th prompt j -th essay, and s c o r e i j is the actual score for i -th prompt j -th essay, m a x s c o r e i and m i n s c o r e i are the maximum and minimum of the real scores for the i -th prompt. The input is essay representation r e from the pre-trained encoder, which is fed into a linear layer with a sigmoid activation function:

where s ^ is the predicted score by AES system, σ is the sigmoid function, W e s is a trainable weights, and b e s is a bias. The essay scoring (es) training objective is described as:

3.4. Subtask 1: Prompt Prediction

The definition of prompt prediction is giving an essay to determine which prompt it belongs to. We view prompt prediction as a classification task. The input is essay representation r e , which is fed into a linear layer with a softmax function. The formula is given by Equation ( 5 ):

where u ^ is the probability distribution of classification results, W p p is a parameter matrix, and b p p is a bias. The loss function is formalized as follows:

where u k is the real prompt label for the k -th sample, p p p k c is the probability that the k -th sample belongs to the c -th category, C denotes the number of prompts, which in this study is ten.

3.5. Subtask 2: Prompt Matching

The definition of prompt matching is giving a pair of a prompt and an essay, and to decide if the essay and the prompt are compatible. We consider prompt matching to be a classification task. The following is the formula:

where v ^ is the probability distribution of matching results, W p m is a parameter matrix, and b p m is a bias. The objective function is shown in Equation ( 9 )

where v k indicates whether the input prompt and essay match. p p m k m is the likelihood that the matching degree of k -th sample falls into category m. m denotes the matching degree, 0 for a match, 1 for a dismatch. The distinction between prompt prediction and prompt matching is that as the number of prompts increases, the difference in classification targets leads to increasingly obvious differences in task difficulty, sample distribution and diversity, and scalability.

3.6. Multi-Task Loss Function

The final loss function for each input is a weighted sum of the loss functions for essay scoring and two subtasks: prompt prediction and prompt matching, with the loss formalized as follows:

where α , β , and γ are non-negative weights assigned in advance to balance the importance of the three tasks. Because the objective of this research is to improve the AES system, the main task should be given more weight than the two auxiliary tasks. The optimal parameters in this paper are α : β = α : γ = 100:1, and in Section 5.3 , we design experiments to figure out the optimal value interval for α , β , and γ .

4. Experiment

4.1. dataset.

We use HSK (HSK is the acronym of Hanyu Shuiping Kaoshi, which is Chinese Pinyin for the Chinese Proficiency Test). Dynamic Composition Corpus ( http://hsk.blcu.edu.cn/ (accessed on 6 March 2022)) as our dataset as in existing studies [ 31 ]. HSK is also called “TOEFL in Chinese”, which is a national standardized test designed to test the proficiency of non-native speakers of Chinese. The HSK corpus includes 11,569 essays composed by foreigners from more than thirty different nations or regions in response to more than fifty distinct prompts. We eliminate any prompts with fewer than 500 student writings from the HSK dataset to constitute the experimental data. The statistical results of the final filtered dataset are provided in Table 1 , which comprises 8878 essays across 10 prompts taken from the actual HSK test. Each essay score ranges from 40 to 95 points. We divide the entire dataset at random into the training set, validation set, and test set in the ratio of 6:2:2. To alleviate the problem of insufficient data under a single prompt, we apply the entire training set that consists of different prompts for fine-tuning. We test every prompt individually as well as the entire test set during the testing phase and utilize the same 5-fold cross-validation procedure as [ 4 , 5 ]. Finally, we report the average performance.

HSK dataset statistic.

4.2. Evaluation Metrics

For the main task, we use the Quadratic Weighted Kappa (QWK)approach, which is widely used in AES [ 32 ], to analyze the agreement between prediction scores and the ground truth. QWK can be calculated by Equations ( 11 ) and ( 12 )

where i and j are the golden score of the human rater and the AES system score, and each essay has N possible ratings. Second, calculate the QWK score using Equation ( 12 ).

where O i , j denotes the number of essays that receive a rating i by the human rater and a rating j by the AES system. The expected rating matrix Z is histogram vectors of the golden rating and AES system rating and normalized so that the sum of its elements equals the sum of its elements in O . We also utilize Pearson’s Correlation Coefficient (PCC) to measure the association as in previous studies [ 3 , 32 , 33 ], which quantifies the degree of linear dependency between two variables and describes the level of covariation. In contrast to the QWK metric, which evaluates the agreement between the model output and the gold standard, we use PCC to assess whether the AES system ranks essays similarly to the gold standard, indicating the capacity of the AES system to appropriately rank texts, i.e., high scores ahead of low scores. For auxiliary tasks, we consider prompt prediction and prompt matching as classification problems and use macro-F1 score (F1), and accuracy (Acc.) as evaluation metrics.

4.3. Comparisons

Our model is compared to the baseline models listed below. The former three are existing neural AES methods, and we experiment with both character and word input when training for comparison. The fourth method is to fine-tune the pre-trained model, and the rest are variations of our proposed method.

CNN-LSTM [ 4 ]: This method builds a document using CNN for word-level representation and LSTM for sentence-level representation, as well as the addition of a pooling layer to obtain the text representation. Finally, the score is obtained by applying the linear layer of the sigmoid function.

CNN-LSTM-att [ 5 ]: This method incorporates an attention mechanism into both the word-level and sentence-level representations of CNN-LSTM.

EModel (Pro.): This method concatenates the prompt information in the input layer of CNN-LSTM-att, which is a sub-module of [ 7 ].

BERT/NEZHA-FT: This method is used to fine-tune the pre-trained model. To obtain the essay representation, we directly feed an essay into the pre-trained encoder as the input. We choose the [CLS] embedding as essay representations and feed them into a linear layer of the sigmoid function for scoring.

BERT/NEZHA-concat: The difference between this method and fine-tune is that the input representation concatenates the prompt to the front of the essay in token embedding, as in Figure 1 .

BERT/NEZHA-PP: This model incorporates prompt prediction as an auxiliary task, with the same input as the concat model and the output using [CLS] as the essay representation. A linear layer with the sigmoid function is used for essay scoring, and a linear layer with the softmax function is used for prompt prediction.

BERT/NEZHA-PM: This model includes prompt matching as an auxiliary task. In the input stage of constructing the training data, there is a 50% probability that the prompt and the essay are mismatched. [CLS] embedding is used to represent the essay. A linear layer with the sigmoid function is used for essay scoring, and a linear layer with the softmax function is used for prompt matching.

BERT/NEZHA-PP&PM: This model utilizes two auxiliary tasks, prompt prediction, and prompt matching, with the same inputs and outputs as the PM model. The output layer of the auxiliary tasks is the same as above.

4.4. Parameter Settings

We use BERT ( https://github.com/google-research/bert (accessed on 11 March 2022)) and NEZHA ( https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-TensorFlow (accessed on 11 March 2022)) as pre-trained encoder. To obtain tokens and token embeddings, we employ the tokenizer and vocabulary of the pre-trained encoder. The parameters of the pre-trained encoder are learnable during both the fine-tuning and training phases. The maximum length of the input is set to 512 and Table 2 includes additional parameters. The baseline models, CNN-LSTM and CNN-LSTM-att, are trained from scratch, and their parameters are shown in Table 2 . Our experiments are carried out on NVIDIA TESLA V100 32 G GPUs.

Parameter settings.

5. Results and Discussions

5.1. main results and analysis.

We report our experimental results in Table 3 and Table A1 (Due to space limitations, this table is included in Appendix A ). Table A1 illustrates the average QWK and PCC for each prompt. Table 3 shows QWK and PCC across the entire test set and the average results of each prompt test set. As shown in Table 3 , we can find that the proposed auxiliary tasks (PP, PM, and PP&PM) (line 8–10 & 13–15) outperform other contrast models on both QWK and PCC, PP&PM models with the pre-trained encoder, BERT, and NEZHA, outperform PP and PM on QWK. In terms of the PCC metric, PM models exceeded the other two models except for the average result with the NEZHA encoder. The findings above indicate that our proposed two auxiliary tasks are both effective.

QWK and PCC for the total test set and Average QWK and PCC for each prompt test set; † denotes input as a character; ‡ denotes input as word. The best results are in bold.

On Total test set, our best results, a pre-trained encoder with PM and PP, are higher compared to fine-tuning method and EModel(Pro.), exceed the strong baseline concat model by 1.8% with BERT and 2.3% with NEZHA on QWK, and get a generally consistent correlation. It is shown from Table 3 that our proposed models also yield similar results to the Average test set, 1.6% of BERT and 2% of NEZHA on QWK of PP&PM models compared to concat model, 2% of BERT and 2.5% of NEZHA on QWK of PP&PM models compared to fine-tuning model, and competitive results on PCC metric. Using the multi-task learning approach and fine-tuning comparison, our proposed approach outperforms the baseline system on both QWK and PCC, indicating that better essay representation can be obtained through multi-tasking learning. Furthermore, when compared to the concat model with fused prompt representation, our proposed approach outperform the baseline in QWK scores, but line 10 and line 15 in Table 3   Total track PCC values are lower within 1% of the baseline. It demonstrates that our proposed auxiliary task is effective in representing the essay prompt.

We train the hierarchical model (line 1–4) using character and word as input, respectively, and the results show that using the character for training is generally better, with the best results in Total and Average being more than 4% lower than those with the pre-training method. The results indicate that using pre-trained encoders both BERT and NEZHA for feature extraction works well on the HSK dataset. The pre-training model comparison reveals that BERT and NEZHA are competitive, with NEZHA delivering the best results.

Results of each prompt with BERT and NEZHA are displayed in Figure 2 . The results of our proposed models (PP, PM, and PP&PM) have made positive progress on several prompts. Among them, the results of PP&PM, in addition, to prompt 1 and prompt 5, extend beyond the two baselines of fine-tuning and concat . The results indicate that our proposed auxiliary tasks to incorporate prompt is generic and can be employed with a range of genres and prompts. The primary cause of the results of individual prompts being suboptimal is that the hyperparameters of loss function α , β , and γ are not adjusted specifically for each prompt and we will further analyze the reasons for this in Section 5.3 .

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g002.jpg

( a ) Results of each prompt with BERT pre-trained encoder on QWK; ( b ) Results of each prompt with NEZHA pre-trained encoder on QWK.

5.2. Result and Effect of Auxiliary Tasks

Table 4 depicts the results of the auxiliary tasks (PP and PM) on validation set, the accuracy and F1 are both greater than 85% for BERT and 90% for NEZHA, and the model is well trained in the auxiliary task, when compared to both pre-trained models BERT and NEZHA, the latter produces better. The results of auxiliary tasks with NEZHA perform better as feature extraction modules.

Accuracy and F1 for PP and PM on validation set.

Comparing the contribution of PP and PM, as shown in Table A1 and Table 3 and Figure 3 , the contribution of PM is higher and more effective. Figure 3 a,b illustrate radar graphs of various pre-trained encoders of PP and PM across 10 prompts utilizing QWK metrics. Figure 3 a shows that the QWK value of PM is higher than PP in all but prompt 9 with BERT encoder, and Figure 3 b demonstrates that the results of PM are 60% better compared to those of PP, implying that PM is also superior to PP for a specific prompt. The PM and PP comparison results for the Total and Average datasets are provided in Figure 3 c,d. Except for the PM model with the NEZHA pre-trained encoder, which has a slightly lower QWK than the PP model, all models that use PM as a single auxiliary task perform better, further demonstrating the superiority of prompt matching in prompt representing and incorporating.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g003.jpg

( a ) Radar graph of BERT-PP&BERT-PM; ( b ) Radar graph of NEZHA-PP&NEZHA-PM; ( c ) Results of PP and PM on QWK; ( d ) Results of PP and PM on PCC.

5.3. Effect of Loss Weight

We examine how the ratio of loss weight parameters β and γ affects the model. Figure 4 a shows that the model works best when the ratio is 1:1 on both QWK and PCC metrics. Figure A1 depicts the QWK results for various β and γ ratios, as well as revealing that the model produces the greatest results at around 1:1 for different prompts, except for prompts 1, 5, and 6, and the same is true for the average results. Concerning the issue of our model being suboptimal for individual prompts, Figure A1 illustrates that the best results for prompts 1, 5, and 6 are not achieved at 1:1, suggesting that it is inappropriate for such parameters in these prompts. Because we disorder the entire training set and fix the β and γ ratio before testing it independently, the parameters of the different prompts cannot be dynamically adjusted within a single training procedure. The reasons are to address the lack of data and also to focus more on the average performance of the model, which also prevents the model from overfitting for specific prompts. Compared to the results in Table A1 , NEZHA-PP and NEZHA-PM both outperform the baselines and the PP&PM model for prompt 1, indicating that both PP and PM can enhance the results when employed separately. For prompt 5, NEZHA-PP performs better than NEZHA-PM, showing that PP plays a greater role. The PP&PM model is already the best result for prompt 6, even though the 1:1 parameter is not optimal in Figure A1 , demonstrating that there is still potential for improvement. The information above reveals that different prompts have varying degrees of difficulty for joint training and parameter optimization of the main and auxiliary tasks, along with different conditions of applicability for the two auxiliary tasks we presented.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g004.jpg

( a ) The effect of PP&PM in different β / γ ratios of QWK and PCC on Total dataset, we fix the value of α in this section of the experiment.; ( b ) The smoothing results for training losses across all tasks; ( c ) The results of different α : β (PP), α : γ (PM), and α : β : γ (PP&PM) ratios on QWK.

We also measure the effect of α on the model, where we fix the β / γ ratio constant at 1:1. Figure 4 c demonstrates that the PP, PM, and PP&PM models are all optimal at α : β = α : γ = 100:1, with the best QWK values for PP&PM, indicating that our suggested method of combining two auxiliary tasks for joint training is effective. The observation of [ 1 , 100 ] shows that when the ratio is small, the main task cannot be trained well, the two auxiliary tasks have a negative impact on the main task, but the single auxiliary task has less impact, indicating that multiple auxiliary tasks are more difficult to train concurrently than a single auxiliary task. In addition, future research should consider how to dynamically optimize the parameters of multiple tasks.

The training losses for ES, PP, and PM are included in Figure 4 b, and it can be seen that the loss of the main task decreases rapidly in the early stage, and the model converges around 6000 steps. The reason for faster model convergence in PM is that the task is a dichotomous classification compared to PP, which is a ten classification, and additionally, among the ten prompts, prompt 6 “A letter to parent” and prompt 9 “Parents are children’s first teachers” are more similar, making PP more difficult. As a result, further research into how to select the appropriate weight ratio and design more matching auxiliary tasks is required.

6. Conclusions and Future Work

This paper presents a pre-training and then fine-tuning model for automated essay scoring. The model incorporates the essay prompts to the model input and obtains better features more applicable to essay scoring by multi-task learning with two auxiliary tasks, prompt prediction, and prompt matching. Experiments demonstrate that the model outperforms baselines in results measured by the QWK and PCC on average across all results on the HSK dataset, indicating that our model is substantially better in terms of agreement and association. The experimental results also show that both auxiliary tasks can effectively improve the model performance, and the combination of the two auxiliary tasks with the NEZHA pre-trained encoder yields the best results, with QWK enhancing 2.5% and PCC improving 2% compared to the strong baseline, the concatenate model, on average across all results on the HSK dataset. When compared to existing neural essay scoring methods, the experimental results show that QWK improves by 7.2% and PCC improves by 8% on average across all results.

Although our work has enhanced the effectiveness of the AES system, there are still limitations. Regarding the data dimension, this research primarily investigates fusing prompt features in Chinese; other languages are not examined extensively. Nevertheless, our method is more convenient for migration than the manual annotation approach, and other languages can be directly migrated. Furthermore, other features in different languages can use our method to create similar auxiliary tasks for information fusion. Moreover, as the number of prompts grows, the difficulty of training for prompt prediction increases, and we will consider combining prompts with genre and other information to design auxiliary tasks suitable for more prompts, as well as attempting to find a balance between the number of essays and the number of prompts to make prompt prediction more efficient. The parameters of the loss function are now defined empirically at the methodological level, which is not conducive to additional auxiliary activities. In future work, we will optimize the parameter selection scheme and build dynamic parameter optimization techniques to accommodate variable numbers of auxiliary tasks. In terms of application, our approaches focus on fusing textual information in prompts, while they do not cover all prompt forms. Our system now requires additional modules for the chart and picture prompt. In future research, we will experiment with multimodal prompt data to improve the application scenarios of the AES system.

Abbreviations

The following abbreviations are used in this manuscript:

QWK and PCC for each prompt on HSK dataset, † denotes input as character; ‡ denotes input as word. The best results are in bold.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g0A1.jpg

The effect of PP&PM in different β / γ ratios of QWK across all dataset, we fix the value of α in this section of the experiment.

Funding Statement

This research was funded by the National Natural Science Foundation of China (Grant No.62007004), the Major Program of the National Social Science Foundation of China (Grant No.18ZDA295), and the Doctoral Interdisciplinary Foundation Project of Beijing Normal University (Grant No.BNUXKJC2020).

Author Contributions

Conceptualization and methodology, J.S. (Jingbo Sun); writing—original draft preparation, J.S. (Jingbo Sun) and T.S.; writing—review and editing, T.S., J.S. (Jihua Song) and W.P. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A review of deep-neural automated essay scoring models

  • Review Paper
  • Open access
  • Published: 20 July 2021
  • Volume 48 , pages 459–484, ( 2021 )

Cite this article

You have full access to this open access article

  • Masaki Uto   ORCID: orcid.org/0000-0002-9330-5158 1  

7915 Accesses

28 Citations

4 Altmetric

Explore all metrics

Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have been proposed over the past few years. To our knowledge, however, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify the AES task into four types and introduce existing DNN-AES models according to this classification.

Similar content being viewed by others

about automated essay scoring

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Laith Alzubaidi, Jinglan Zhang, … Laith Farhan

about automated essay scoring

An automated essay scoring systems: a systematic literature review

Dadi Ramesh & Suresh Kumar Sanampudi

about automated essay scoring

Machine learning techniques for credit risk evaluation: a systematic literature review

Siddharth Bhatore, Lalit Mohan & Y. Raghu Reddy

Avoid common mistakes on your manuscript.

1 Introduction

Essay-writing tests have attracted much attention as a means to measuring practical and higher-order abilities such as logical thinking, critical reasoning, and creative thinking in various assessment fields (Abosalem 2016 ; Bernardin et al. 2016 ; Liu et al. 2014 ; Rosen and Tager 2014 ; Schendel and Tolmie 2017 ). In essay-writing tests, examinees write an essay about a given topic, and human raters grade those essays. However, essay grading is an expensive and time-consuming process, especially when there are many examinees (Hussein et al. 2019 ; Ke and Ng 2019 ). In addition, grading by human raters is not always consistent among and within raters (Eckes 2015 ; Hua and Wind 2019 ; Kassim 2011 ; Myford and Wolfe 2003 ; Rahman et al. 2017 ; Uto and Ueno 2018a ). One approach to resolving this problem is automated essay scoring (AES), which utilizes natural language processing (NLP) and machine learning techniques to automatically grade essays.

Many AES models have been developed over recent decades, and these can generally be classified as feature-engineering or automatic feature extraction approaches (Hussein et al. 2019 ; Ke and Ng 2019 ).

AES models based on the feature-engineering approach predict scores using textual features that are manually designed by human experts (e.g., Dascalu et al. 2017 ; Mark and Shermis 2016 ; Nguyen and Litman 2018 ). Typical features include essay length and the number of grammatical and spelling errors. The AES model first calculates these types of textual features from a target essay, then inputs the feature vector into a regression or classification model and outputs a score. Various models based on this approach have long been proposed (e.g., Nguyen and Litman 2018 ; Attali and Burstein 2006 ; Phandi et al. 2015 ; Beigman Klebanov et al. 2016 ; Cozma et al. 2018 ). For example, e-rater (Attali and Burstein 2006 ) is a representative model that was developed and has been used by the Educational Testing Service. Another recent popular model is the Enhanced AI Scoring Engine (Phandi et al. 2015 ), which achieved high performance in the Automated Student Assessment Prize (ASAP) competition run by Kaggle.

The advantages of feature-engineering approach models include interpretability and explainability. However, this approach generally requires extensive effort in engineering and tuning features to achieve high scoring accuracy for a target collection of essays. To obviate the need for feature engineering, automatic feature extraction approach models based on deep neural networks (DNNs) have recently attracted attention. Many DNN-AES models have been proposed over the last five years and have achieved state-of-the-art accuracy (e.g., Alikaniotis et al. 2016 ; Taghipour and Ng 2016 ; Dasgupta et al. 2018 ; Farag et al. 2018 ; Jin et al. 2018 ; Mesgar and Strube 2018 ; Wang et al. 2018 ; Mim et al. 2019 ; Nadeem et al. 2019 ; Uto et al. 2020 ; Ridley et al. 2021 ). The purpose of this paper is to review these DNN-AES models.

Several recent studies have reviewed AES models (Ke and Ng 2019 ; Hussein et al. 2019 ; Borade and Netak 2021 ). For example, Ke and Ng ( 2019 ) reviewed various AES models, including both feature-engineering approach models and DNN-AES models. However, because the purpose of their study was to present an overview of major milestones reached in AES research since its inception, they provided only a short summary of each DNN-AES model. Another review (Hussein et al. 2019 ) explained some DNN-AES models in detail, but only a few models were introduced. Borade and Netak ( 2021 ) also reviewed AES models, but they focused on feature-engineering approach models.

To our knowledge, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify AES tasks into four types according to recent findings (Li et al. 2020 ; Ridley et al. 2021 ), and introduce existing DNN-AES models according to this classification.

2 Automated essay scoring tasks

AES tasks are generally classified into the following four types (Li et al. 2020 ; Ridley et al. 2021 ).

Prompt-specific holistic scoring This is the most common AES task type, whereby an AES model is trained using rated essays that have holistic scores and have been written for a prompt. This trained model is used to predict the scores of essays written for the same prompt. Note that a prompt refers to an essay topic or a writing task that generally consists of reading materials and a task instruction.

Prompt-specific trait scoring This task involves predicting multiple trait-specific scores for each essay in a prompt-specific setting in which essays used for model training and unrated target essays are written for the same prompt. Such scoring is often required when an analytic rubric is used to provide more detailed feedback for educational purposes.

Cross-prompt holistic scoring In this task, an AES model is trained using rated essays with holistic scores written for non-target prompts and the trained model is transferred to a target prompt. This task has recently attracted attention because it is difficult to obtain a sufficient number of rated essays written for a target prompt in practice. This task includes a zero-shot setting in which rated essays written for a target prompt do not exist, and another setting in which a relatively small number of rated essays written for a target prompt can be used. The cross-prompt AES task relates to domain adaptation and transfer learning tasks, which are widely studied in machine learning fields.

Cross-prompt trait scoring This task involves predicting multiple trait-specific scores for each essay in a cross-prompt setting in which essays written for non-target prompts are used to train an AES model.

In the following section, we review representative DNN-AES models for each task type. Table  1 summarizes the models introduced in this paper.

3 Prompt-specific holistic scoring

This section introduces DNN-AES models for prompt-specific holistic scoring.

3.1 RNN-based model

One of the first DNN-AES models was a recurrent neural network (RNN)-based model proposed by Taghipour and Ng ( 2016 ). This model predicts a score for a given essay, defined as a sequence of words, by following multi-layered neural networks whose architecture is shown in Fig.  1 .

figure 1

Architecture of RNN-based model

Lookup table layer This layer transforms each word in a given essay into a G -dimensional word-embedding representation. Word-embedding representation is a real-valued fixed-length vector of a word, in which words with similar meaning have similar vectors. Suppose \({{\mathcal {V}}}\) is a vocabulary list for essay collection, \(\varvec{w}_{t}\) represents a \(|{{\mathcal {V}}}|\) -dimensional one-hot representation of t -th word \(w_{t}\) in a given essay, and \(\varvec{A}\) represents a \(G \times |{{\mathcal {V}}}|\) -dimensional trainable embeddings matrix. Then, the embedding representation \(\tilde{\varvec{w}}_t\) corresponding to \(w_{t}\) is calculable as a dot product \(\tilde{\varvec{w}}_t = \varvec{A}\cdot \varvec{w}_{t}\) .

Convolution layer This layer captures local textual dependencies using convolution neural networks (CNNs) from a sequence of word-embedding vectors. Given an input sequence \(\{\tilde{\varvec{w}}_1,\tilde{\varvec{w}}_2, \ldots , \tilde{\varvec{w}}_L\}\) (where L is the number of words in a given essay), this layer is applied to a window of c words to capture local textual dependencies among c -gram words. Concretely, the t -th output of this layer is calculable as follows.

where \(\mathbf {W_c}\) and \(b_c\) are trainable weight and bias parameters, and \([\cdot , \cdot ]\) means the concatenation of the given elements. Zero padding is applied to outputs from this layer to preserve the input and output sequence lengths. This is an optional layer that has often been omitted in recent studies.

Recurrent layer This layer generally uses a long short-term memory (LSTM) network, a representative RNN, that outputs a vector at each timestep while capturing time series dependencies in an input sequence. A single-layer unidirectional LSTM is generally used, but bidirectional or multilayered LSTMs are also often used.

Pooling layer This layer transforms the output hidden vector sequence of the recurrent layer \(\{ \varvec{h}_{1}, \varvec{h}_{2}, \ldots ,\varvec{h}_{L}\}\) (where \(\varvec{h}_{t}\) represents the hidden vector of the t -th output of the recurrent layer) into an aggregated fixed-length hidden vector. Mean-over-time pooling, which calculates an average vector

is generally used because it tends to provide stable accuracy. Other frequently used pooling methods include the last pool (Alikaniotis et al. 2016 ), which uses the last output of the recurrent layer \(\varvec{h}_{L}\) , and an attention pooling layer (Dong et al. 2017 ), which we explain later in the present study.

Linear layer with sigmoid activation This layer projects a pooling layer output onto a scalar value in the range [0, 1] by utilizing the sigmoid function as

where \(\mathbf{W}_o\) is a weight matrix and \(b_o\) represents bias parameters. \(\sigma ()\) represents the sigmoid function.

For model training, the mean-squared error (MSE) between predicted and gold-standard scores is generally used as the loss function. Specifically, letting \(y_{n}\) be the gold-standard score for n -th essay and letting \({\hat{y}}_{n}\) be the predicted score, the MSE loss function is defined as

where N is the number of essays. Note that the model training is conducted after normalizing gold standard scores to [0, 1], but the predicted scores are linearly rescaled to the original score range in the prediction phase.

3.2 RNN-based model with score-specific word embedding

Alikaniotis et al. ( 2016 ) also proposed a similar RNN-based model consisting of three layers, namely, a lookup table layer, a recurrent layer, and a pooling layer. The model uses a bidirectional LSTM for the recurrent layer and the last pooling for the pooling layer. The unique feature of this model is the use of score-specific word embedding (SSWE), which is an extension of Collobert & Weston (C&W) word-embedding (Collobert and Weston 2008 ), in the lookup table layer.

Suppose we train a representation for a target word \(w_t\) within a sequence of one-hot encoded words \(\varvec{S}=\{ \varvec{w}_1, \ldots , \varvec{w}_t, \ldots \varvec{w}_L\}\) . To derive this representation, the C&W word-embedding model learns to distinguish between the original sequence \(\varvec{S}\) and an artificially created noisy sequence \(\varvec{S}'\) in which the target word is substituted for a randomly selected word. Given a trainable embedding matrix \(\varvec{A}\) , the model concatenates the embedding representation vectors of the words in the sequence, that is, \(\tilde{\varvec{S}} = [\varvec{A} \cdot \varvec{w}_1, \varvec{A} \cdot \varvec{w}_2, \ldots , \varvec{A} \cdot \varvec{w}_L]\) . Using the vector, the C&W word-embedding model predicts whether the given word sequence \(\varvec{S}\) is the original sequence or a noisy one based on the following function.

where \(\mathbf{W}_1\) , \(\mathbf{W}_2\) , \(b_1\) , and \(b_2\) are the trainable parameters, and \(\mathrm {htanh}()\) is the hard hyperbolic tangent function.

The SSWE model extends the C&W word-embedding model by adding another output layer that predicts essay scores as follows.

where \(\mathbf{W}'_1\) , \(\mathbf{W}'_2\) , \(b'_1\) , and \(b'_2\) are the trainable parameters. The SSWE model is trained while minimizing a weighted linear combination of two error loss functions, namely, a classification loss function based on Eq. ( 5 ) and a scoring error loss function based on Eq. ( 6 ).

The SSWE model provides a more effective word-embedding representation to distinguish essay qualities than does the C&W word-embedding model. Thus, Alikaniotis et al. ( 2016 ) proposed using the embedding matrix \(\varvec{A}\) trained by the SSWE model in the lookup table layer.

3.3 Hierarchical representation models

figure 2

Architecture of hierarchical representation model

The models introduced above handle an essay as a linear sequence of words. Dong and Zhang ( 2016 ), however, proposed modeling the hierarchical structure of a text. Concretely, they assumed that an essay is constructed as a sequence of sentences defined as word sequences. Accordingly, they introduced a two-level hierarchical representation model consisting of a word-level CNN and a sentence-level CNN, as shown in Fig.  2 . Each CNN works as explained below.

Word-level CNN The sequence of words in each sentence is processed and an aggregated vector is output, which can be taken as an embedding representation of a sentence. Suppose an essay consists of I sentences \(\{\varvec{s}_1, \ldots , \varvec{s}_I\}\) , and each sentence is defined as a sequence of words as \(\varvec{s}_i = \{w_{i1}, \ldots , w_{iL_i}\}\) (where \(w_{it}\) is the t -th word in i -th sentence, and \(L_i\) is the number of words in i -th sentence). For each sentence \(\varvec{s}_i\) , the lookup table layer transforms each word into an embedding representation, and then the word-level CNN processes the sequence of word-embedding vectors. The operation of the word-level CNN is the same as that of the convolution layer explained in Subsection  3.1 . The output sequence of the word-level CNN is transformed into an aggregated fixed-length hidden vector \(\tilde{\varvec{h}}_{s_i}\) through a pooling layer.

Sentence-level CNN This CNN takes the sequence of sentence vectors \(\{\tilde{\varvec{h}}_{s_1}, \ldots , \tilde{\varvec{h}}_{s_I}\}\) as input and extracts n-gram level features over the sentence sequence. Then, a pooling layer transforms the CNN output sequence into an aggregated fixed-length hidden vector \(\tilde{\varvec{h}}\) . Finally, the linear layer with sigmoid activation maps vector \(\tilde{\varvec{h}}\) to a score.

Dong et al. ( 2017 ) proposed another hierarchical representation model that extends the above model by using an attention mechanism (Bahdanau et al. 2014 ) to automatically identify important words and sentences. The attention mechanism is a neural architecture that enables to dynamically focus on relevant regions of input data to make predictions. The main idea of the attention mechanism is to compute a weight distribution on the input data, assigning higher values to more relevant regions. (Dong et al. 2017 ) uses attention-based pooling in the pooling layers. Letting the input sequence for the pooling layer be \(\{\varvec{x}_1, \ldots , \varvec{x}_J\}\) , where J indicates the sequence length, the attention mechanism aggregates the input sequence into a fixed-length vector \(\tilde{\varvec{x}}\) by performing the following operations.

In these equations, \(\mathbf{W}_{a_1}\) , \(\mathbf{W}_{a_2}\) , and b are trainable parameters. \(\tilde{\varvec{x}}_j\) and \(a_j\) are called an attention vector and an attention weight for j -th input, respectively.

In addition to the incorporation of the attention mechanism, Dong et al. ( 2017 ) proposed adding a character-level CNN before the word-level CNN and using LSTM as an alternative to the sentence-level CNN.

3.4 Coherence modeling

Coherence is an important criterion for evaluating the quality of essays. However, the RNN-based models introduced above are known to have difficulty capturing the relationships between multiple regions in an essay because they compress a word sequence within a fixed-length hidden vector in the order they are inputted. To resolve this difficulty, several DNN-AES models that consider coherence features have been proposed (Tay et al. 2018 ; Li et al. 2018 ; Farag et al. 2018 ; Mesgar and Strube 2018 ; Yang and Zhong 2021 ). This subsection introduces two representative models.

3.4.1 SKIPFLOW model

figure 3

Architecture of SKIPFLOW model

Tay et al. ( 2018 ) proposed SKIPFLOW, which learns coherence features explicitly using a neural network architecture. The model is the RNN-based model with a neural tensor layer as shown in Fig.  3 . The neural tensor layer takes two positional outputs of the recurrent layer that are collected from different time steps as input and computes the similarity between each of these pairs of positional outputs. Concretely, for a recurrent layer output sequence \(\{ \varvec{h}_{1}, \varvec{h}_{2}, \ldots ,\varvec{h}_{L}\}\) , the model first selects a pair of sequential outputs of width \(\delta\) , that is, \(\{(\varvec{h}_1,\varvec{h}_\delta ), (\varvec{h}_{\delta +1},\varvec{h}_{2\delta }), \ldots , (\varvec{h}_{t\delta +1}, \varvec{h}_{(t+1)\delta }), \ldots \}\) . Then, each pair of hidden vectors \((\varvec{h}_{t\delta +1}, \varvec{h}_{(t+1)\delta })\) is input into the following neural tensor layer to return a similarity score as

where \(\mathbf{W}_u\) , \(\mathbf{V}\) , and \(\mathbf{b}_u\) are the weight and bias vectors and \(\mathbf{M}\) is a three-dimensional tensor. These are trainable parameters.

The similarity scores for all the pairs are concatenated with the polling layer output vector \(\tilde{\varvec{h}}\) as

and the resulting vector is mapped to a score through a fully connected neural network layer and a linear layer with sigmoid activation.

3.4.2 Self-attention-based model

Li et al. ( 2018 ) proposed another model using a self-attention mechanism to capture relationships between multiple points in an essay. Self-attention mechanisms have been shown to be able to capture long-distance relationships between words in a sequence and have recently been used in various NLP tasks.

figure 4

Architecture of self-attention-based model

Figure  4 shows the model architecture. This model first transforms each word into an embedding representation through a lookup table layer with a position encoding, and then inputs the sequence into a multi-head self-attention model that combines multiple self-attention models in parallel. See Vaswani et al. ( 2017 ) for details of the lookup table layer with the position encoding and the multi-head self-attention architecture. The self-attention output sequence is input into a recurrent layer, a pooling layer, and a linear layer with sigmoid activation to produce an essay score.

3.5 BERT-based models

Bidirectional encoder representations from transformers (BERT), a pre-trained language model released by the Google AI Language team in 2018, has achieved state-of-the-art results in various NLP tasks (Devlin et al. 2019 ). Since then, BERT has also been applied to automated text scoring tasks, including AES (Nadeem et al. 2019 ; Uto et al. 2020 ; Rodriguez et al. 2019 ; Yang et al. 2020 ; Mayfield and Black 2020 ) and automated short-answer grading (Liu et al. 2019 ; Lun et al. 2020 ; Sung et al. 2019 ), and has shown good performance.

BERT is defined as a multilayer bidirectional transformer network (Vaswani et al. 2017 ). Transformers are a neural network architecture designed to handle ordered sequences of data using an attention mechanism. Specifically, transformers consist of multiple layers (called transformer blocks), each containing a multi-head self-attention network and a position-wise fully connected feed-forward network. See (Vaswani et al. 2017 ) for details of this architecture.

figure 5

Architecture of BERT-based model

figure 6

Architecture of BERT-based model with ranking task

BERT is trained in pre-training and fine-tuning steps. Pre-training is conducted on huge amounts of unlabeled text data over two tasks, namely, masked language modeling and next-sentence prediction. Masked language modeling is the task that predicts the identities of words that have been masked out of the input text. Next-sequence prediction is the task that predicts whether two given sentences are adjacent.

Using pre-trained BERT for a target NLP task, such as AES, requires fine-tuning (retraining), which is conducted from a task-specific supervised dataset after initializing model parameters to pre-trained values. When using BERT for AES, input essays require preprocessing, namely, adding a special token (“CLS”) to the beginning of each input. BERT output corresponding to this token is used as the aggregate hidden representation for a given essay (Devlin et al. 2019 ). We can thus score an essay by inputting its representation into a linear layer with sigmoid activation, as illustrated in Fig.  5 .

Furthermore, Yang et al. ( 2020 ) proposed fine-tuning the BERT model so that the essay scoring task and an essay ranking task are jointly resolved. As shown in Fig.  6 , the proposed model is formulated as a BERT-based AES model with an additional output layer that predicts essay ranks. The model uses ListNet (Cao et al. 2007 ) for predicting the ranking list. This model is fine-tuned by minimizing a combination of the scoring MSE loss function and a ranking error loss function based on ListNet.

3.6 Hybrid models

figure 7

Architecture of hybrid model with additional RNN for sentence-level features

figure 8

Architecture of DNN-AES with handcrafted essay-level features

The feature-engineering approach and the DNN-AES approach can be viewed as complementary rather than competing approaches (Ke and Ng 2019 ; Uto et al. 2020 ) because they provide different advantages. To receive both benefits, some hybrid models that integrate the two approaches have been proposed (Dasgupta et al. 2018 ; Uto et al. 2020 ).

One of the hybrid models is proposed by Dasgupta et al. ( 2018 ). Figure  7 shows the model architecture. As shown in the figure, it mainly consists of two DNNs. One processes word sequences in a given essay in the same way as the conventional RNN-based model (Taghipour and Ng 2016 ). Specifically, a word sequence is transformed into a fixed-length hidden vector \(\tilde{\varvec{h}}\) through a lookup table layer, a convolution layer, a recurrent layer, and a pooling layer. The other DNN processes a sequence of manually designed sentence-level features. Letting a given essay have I sentences, and letting \(\varvec{f}_{i}\) be a manually designed sentence-level feature vector for i -th sentence, the feature sequence \(\{\varvec{f}_{1},\varvec{f}_{2},\ldots ,\varvec{f}_{I}\}\) is transformed into a fixed-length hidden vector \(\tilde{\varvec{h}}_f\) through a convolution layer, a recurrent layer, and a pooling layer. The model uses LSTM for the recurrent layer and attention pooling for the pooling layer. Finally, after concatenating the hidden vectors \([\tilde{\varvec{h}}, \tilde{\varvec{h}}_f]\) , a linear layer with sigmoid activation maps it to a score.

Another hybrid model is formulated as a DNN-AES model incorporating manually designed essay-level features (Uto et al. 2020 ). Concretely, letting \(\varvec{F}\) be a manually designed essay-level feature vector, the model concatenates the feature vector with the hidden vector \(\tilde{\varvec{h}}\) , which is obtained from a DNN-AES model. Then, a linear layer with sigmoid activation maps the concatenated vector \([\tilde{\varvec{h}}, \varvec{F}]\) to a score value. Figure  8 shows the architecture of this model. This hybrid model is easy to construct using various DNN-AES models.

3.7 Improving robustness for biased training data

DNN-AES models generally require a large dataset of essays graded by human raters as training data. When creating a training dataset, essay grading tasks are generally shared among many raters by assigning a few raters to each essay to lower the burden of assessment. However, in such cases, assigned scores are known to be biased owing to the effects of rater characteristics (Rahman et al. 2017 ; Amidei et al. 2020 ). The performance of AES models drops when biased data are used for model training because the resulting model reflects the bias effects (Amorim et al. 2018 ; Huang et al. 2019 ; Li et al. 2020 ).

To resolve this problem, Uto and Okano ( 2020 ) proposed an AES framework that integrates item response theory (IRT), a test theory based on mathematical models. Specifically, they used an IRT model incorporating parameters representing rater characteristics (e.g., Eckes 2015 ; Uto and Ueno 2016 , 2018a ) that can estimate essay scores while mitigating rater bias effects. The applied IRT model is the generalized many-facet Rasch model (Uto and Ueno 2018b , 2020 ) that defines the probability that rater r assigns score k to n -th essay for a prompt as

where \(\alpha _r\) is the consistency of rater r , \(\beta _{r}\) is the severity of rater r , \(\beta _{rm}\) represents the strictness of rater r for category m , and K indicates the number of score categories. Furthermore, \(\theta _n\) represents the latent scores for n -th essay, which removes the effects of the rater characteristics.

Using this IRT model, Uto and Okano ( 2020 ) proposed training an AES model through the following two steps. 1) Apply the IRT model to observed rating data to estimate the IRT-based score \(\theta _n\) , which removes the effects of rater bias. 2) Train an AES model using the unbiased scores \(\varvec{\theta } =\{\theta _1, \ldots , \theta _N\}\) as the gold-standard scores based on the following loss function.

where \({\hat{\theta }}_n\) represents the AES’s predicted score for n -th essay. Because the IRT-based scores are theoretically free from rater bias effects, the AES model will not reflect the bias effects.

In the prediction phase, the score for a new essay is calculated in two steps: (1) Predict the IRT score \(\theta\) for the essay using a trained AES model. (2) Given \(\theta\) and rater parameters, calculate the expected score, which corresponds to an unbiased original-scaled score (Uto 2019 ), as

where R indicates the number of raters who graded essays in the training data. The expected score is used as a predicted essay score, which is robust against rater biases.

3.8 Integration of AES models

Conventional AES models including those introduced above have different scoring characteristics. Therefore, integrating multiple AES models is expected to improve scoring accuracy. For these reasons, Aomi et al. ( 2021 ) proposed a framework that integrates multiple AES models while considering the characteristics of each model using IRT. In the framework, multiple AES models are first trained independently, and the trained models are used to produce prediction scores for target essays. Then, the generalized many-facet Rasch model introduced above is applied to the obtained prediction scores by regarding rater characteristic parameters, \(\alpha _r\) , \(\beta _r\) , and \(\beta _{rm}\) as characteristic parameters of AES models. Given the estimated IRT score \(\theta\) for the target essays, a predicted essay score is calculated as the expected score based on Eq. ( 12 ).

This framework can integrate prediction scores from various AES models while considering the characteristics of each model. Subsequently, it provides scores that are more accurate than those obtained by simple averaging or a single AES model.

4 Prompt-specific trait scoring

This section introduces DNN-AES models for the prompt-specific trait scoring task. Although this task is important especially for educational purposes, only a limited number of models have been proposed for the task.

4.1 Use of multiple trait-specific models

Mathias et al. ( 2020 ) presents one of the first attempts to perform prompt-specific trait scoring based on a DNN-AES model. Their study used the hierarchical representation model with an attention mechanism (Dong et al. 2017 ), introduced in Sect.  3.3 , to predict trait-specific scores for each essay. Concretely, in their study, the AES model was trained for each trait independently, and predicted scores using trait-specific models.

figure 9

Architecture of RNN-based model with multiple output layers for prompt-specific trait scoring

4.2 Model with multiple output modules

Hussein et al. ( 2020 ) proposed a model specialized in a prompt-specific trait scoring task that can predict multiple trait scores jointly. The model is formulated as a multi-output model based on the RNN-based model (Taghipour and Ng 2016 ), introduced in Sect.  3.1 . Concretely, as shown in Fig.  9 , they extended the RNN-based model by adding as many multiple output linear layers as the number of traits. Additionally, an optional fully connected neural network layer was added after the pooling layer. The loss function is defined as a linear combination of multiple MSE loss functions as follows.

where D is the number of traits, \(y_{nd}\) and \({\hat{y}}_{nd}\) are the gold-standard score and predicted d -th trait score for n -th essay, respectively.

5 Cross-prompt holistic scoring

The prompt-specific scoring models introduced above assume situations in which rated training essays and unrated target essays are written for the same prompt. However, we often face situations in which we cannot use any rated essays or only a relatively small number of rated essays written for the target prompt in model training, even though we have many rated essays written for other non-target prompts. AES for such settings is generally called a cross-prompt scoring task. This section introduces cross-prompt holistic scoring models.

5.1 Two-stage learning models

One of the first cross-prompt holistic scoring models using DNN was proposed by Jin et al. ( 2018 ). The method is constructed as a two-stage DNN (TDNN) approach in which a prompt-independent scoring model is trained using rated essays for non-target prompts in the first stage, and is used to generate pseudo rating data for unrated essays in a target prompt. Then, using the pseudo rating data, a prompt-specific scoring model for the target prompt is trained in the second stage. The TDNN is detailed below.

First stage (Training a prompt-independent AES model) In this stage, rated essays written for non-target prompts are used to train a prompt-independent AES model that uses manually designed prompt-independent shallow features, such as the number of typos, grammatical errors, and spelling errors. Here, a ranking support vector machine (Joachims 2002 ) is used as the prompt-independent model.

Second stage (Training a prompt-specific AES model) The trained prompt-independent AES model is used to produce the scores of unrated essays written for a target prompt, and the pseudo scores are used to train a prompt-specific scoring model. To train a prompt-specific scoring model, only confident essays with the highest and lowest pseudo scores are used, instead of using all the produced scores. The prompt-specific AES model in the study by Jin et al. ( 2018 ) used an extended model of the RNN-based model (Taghipour and Ng 2016 ) that can process three types of sequential inputs, namely, a sequence of words, part-of-speech (POS) tags, and syntactic tags.

Li et al. ( 2020 ) pointed out that the TDNN model uses a limited number of general linguistic features in the prompt-independent AES model, which may seriously affect the accuracy of the generated pseudo scores for essays in a target prompt. To extract more efficient features, they proposed another two-stage framework called a shared and enhanced deep neural network (SEDNN) model. The SEDNN model consists of two stages, described as follows.

First stage As an alternative to a prompt-independent model with manually designed shallow linguistic features, the SEDNN uses a DNN-AES model that extends the hierarchical representation model with an attention mechanism (Dong et al. 2017 ), introduced in Sect.  3.3 . Concretely, in the model, a new output layer is added to jointly solve the AES task and a binary classification task that distinguishes whether a given essay was written for the target prompt. The model is trained based on a combination of the loss functions for the essay scoring task and the prompt discrimination task using a dataset consisting of rated essays written for non-target prompts and the unrated essays written for the target prompt.

Second stage As in the second stage of the TDNN model, scores of unrated essays written for a target prompt are generated by the prompt-independent AES model, and the pseudo scores are used to train a prompt-specific scoring model. The prompt-specific scoring model in the study by Li et al. ( 2020 ) is a Siamese network model that jointly uses the essay text and the text of the target prompt itself to learn prompt-dependent features more efficiently. In the model, an essay text is processed by a similar model to the SKIPFLOW (Tay et al. 2018 ) and is transformed into vector representations. The word sequence in the prompt text is also transformed into a fixed-length hidden vector representation by another neural architecture consisting of a lookup table layer, a convolution layer, a recurrent layer, and a mechanism that measures the relevance relation between the given essay and the target prompt text. After concatenating the two vector representations corresponding to an essay text and a prompt text, a linear layer with sigmoid activation maps it to a prediction score.

5.2 Multi-stage pre-training approach model

Another cross-prompt holistic scoring approach incorporates pre-training processes. In the approach, an AES model is developed by performing pre-training on a vast number of essays with or without scores written for non-target prompts, and then the model is fine-tuned using a limited number of rated essays written for a target prompt. The pre-training process enables a DNN model to capture a general language model for predicting essay quality. Thus, the use of a pre-trained model as an initial model helps in obtaining a model for a target scoring task. The BERT-based AES models explained in Sect.  3.5 are examples of the pre-training and fine-tuning approach models. In various NLP tasks, the use of pre-training has been popular and has achieved great success.

For cross-prompt holistic scoring, (Song et al. 2020 ) proposed training the hierarchical representation model with the attention mechanism (Dong et al. 2017 ), as explained in Sect.  3.3 , through the following three pre-training and fine-tuning steps.

Weakly supervised pre-training The AES model is trained based on a vast number of roughly scored essays written for diverse prompts collected from the Web. The study by Song et al. ( 2020 ) assumed that binary scores are given to the essays; thus, this step is called weakly supervised. The objective of this pre-training step was to have the AES model learn a general language representation that can roughly distinguish essay quality.

Cross-prompt supervised fine-tuning If we have rated essays written for non-target prompts, the pre-trained model is fine-tuned using the data.

Target-prompt supervised fine-tuning The model obtained from the above steps is fine-tuned using rated essays written for the target prompt. The study by Song et al. ( 2020 ) reported that incorporating the above two-stage pre-training and fine-tuning improves the performance of the target-prompt scoring.

5.3 Model with self-supervised learning

Cao et al. ( 2020 ) proposed another cross-prompt holistic scoring model that was designed to solve the AES task with two prompt-independent self-supervised learning tasks jointly. The two self-supervised learning tasks, which are appended to efficiently extract prompt-independent common knowledge, are a sentence reordering task and a noise identification task , as explained bellow.

Sentence reordering In this task, each essay is divided into four parts and then shuffled according to a certain permutation order. The sentence reordering task predicts an appropriate permutation for each given essay.

Noise identification In this task, each essay is transformed into noisy data by performing random insertion, random swap, and random deletion operations on 10% of the words in the essay. The noise identification task predicts whether a given essay is noisy or not.

The above two self-supervised learning tasks are simultaneously trained with the AES task in a model. Figure  10 shows the model architecture. This model has a shared encoder that transforms an input word sequence into a fixed-length essay representation vector, and three task-specific output layers.

figure 10

Architecture of cross-prompt holistic scoring model with self-supervised learning

The shared encoder is formulated as a hierarchical representation DNN model such as that introduced in Sect.  3.3 . In this model, a sequence of words corresponding to each sentence is transformed into a fixed-length sentence representation vector through a lookup table layer, a recurrent layer, a self-attention layer, a fusion gate, and a mean-over-time pooling layer. Here, the fusion gate is an operation that combines the input and output of the self-attention layer as follows.

where \(\varvec{H}_{s_i}\) and \(\tilde{\varvec{H}}_{s_i}\) are the input and output vector sequences of the self-attention layer for i -th sentence, and \(\hat{\varvec{H}}_{s_i}\) is the fusion gate output. \(\mathbf{W}_{g1}\) and \(\mathbf{W}_{g2}\) are trainable parameters. The essay representation vector is calculated by averaging the obtained sentence vectors, and the vector is used for the AES and the two self-supervised learning tasks. This model is trained based on a weighted sum of the MSE loss function for the AES and error loss functions for the two self-supervised learning tasks.

Furthermore, Cao et al. ( 2020 ) proposed a technique to improve the adaptability of the model to a target prompt. Concretely, during the model training processes, this technique calculates the averaged essay representation vector for each prompt and shifts the representation of each essay into the target prompt’s averaged vector.

6 Cross-prompt trait scoring

This section introduces cross-prompt trait scoring models that predict multiple trait-specific scores for each essay in a cross-prompt setting.

6.1 Use of multiple trait-specific models with self-supervised learning

Mim et al. ( 2019 ) proposed a method to predict two trait scores, namely, coherence and argument strength , for each essay. They used a vast number of unrated essays for non-target prompts to pre-train a DNN model, and then the model was transferred to a target AES task. They used the RNN-based model (Taghipour and Ng 2016 ) introduced in Sect.  3.1 as the base model. The detailed processes are as follows.

Pre-training based on self-supervised learning with non-target essays In this step, the base model is trained using unrated essays written for non-target prompts based on a self-supervised learning task, which is a binary classification task that distinguishes artificially created incoherent essays. For the self-supervised learning task, incoherent essays are created by randomly shuffling sentences, discourse indicators, and paragraphs in the original essays. This pre-training is introduced to enable the base model to learn features for distinguishing logical text from illogical text.

Pre-training based on self-supervised learning with target essays The pre-trained model is retrained using essays written for the target prompt based on the same self-supervised task described above. This step is introduced to alleviate mismatch between essays written for non-target prompts and those written for the target prompt.

Fine-tuning for AES The pre-trained model is fine-tuned for the AES task using rated essays for the target prompt. Note that, for the AES task, the base model is extended by adding two RNN-based architectures that process a prompt text and a sequence of paragraph function labels (i.e., Introduction, Body, Rebuttal and Conclusion). The fine-tuning is conducted independently for two traits, namely, coherence and argument strength .

6.2 Model with multiple output modules

Ridley et al. ( 2021 ) proposed a model specialized in trait scoring that can predict multiple trait scores jointly. As shown in Fig.  11 , the model is formulated as the following multi-output DNN model.

figure 11

Architecture of cross-prompt trait scoring model with multiple output layers

Shared layers The model first processes an input through shared layers that is commonly used for predicting all trait scores. The shared layers consist of a POS embedding layer, a convolutional layer, and an attention pooling layer, as explained below.

A POS embedding layer takes a sequence of POS tags for words in a given essay and transforms it into embedding representations, using the same operations as in the lookup table layer. Note that this model uses a POS tag sequence as the input instead of a word sequence because word information depends strongly on a prompt, but POS information that represents syntactic information is more adaptable to different prompts.

A convolutional layer extracts n-gram level features from a sequence of POS embeddings for each sentence in the same way as described in Sect.  3.1 .

An attention pooling layer applies an attention mechanism to produce a fixed-length vector representation for each sentence from the convolutional layer outputs.

Trait-specific layers The sequence of sentence representations produced by the shared convolutional layer is input into trait-specific layers that are used for predicting each trait score through the following procedures.

The sentence representation sequence is transformed into a fixed-length vector corresponding to essay representation through a recurrent layer, and an attention pooling layer.

The essay representation vector is concatenated with prompt-independent manually designed features, similar to those used in the first stage of TDNN.

To obtain a final representation for each trait score, the model applies an attention mechanism so that each trait-specific layer can utilize the relevant information from the other trait-specific layers.

A linear layer with sigmoid activation maps the aggregated vector to a corresponding trait score.

The loss function for training this model is similar to Eq. ( 13 ). Note that, because different prompts are often designed to evaluate different trait scores, the model introduces a masking function. Concretely, letting \(mask_{nd}\) be a variable that takes 1 if the prompt corresponding to n -th essay that has d -th trait score, and 0 otherwise, the loss function with the mask function is defined as follows.

The mask function enables the loss values for the traits without the gold scores to be 0.

A special case of this model that has a single output layer for the holistic score has also been proposed as a cross-prompt holistic scoring model (Ridley et al. 2020 ).

7 Conclusions and remarks

This review has presented a comprehensive survey of DNN-AES models. Concretely, we classified the AES task into four types, namely, (1) prompt-specific holistic scoring, (2) prompt-specific trait scoring, (3) cross-prompt holistic scoring, and (4) cross-prompt trait scoring, and introduced the main ideas and the architectures of representative DNN-AES models for each task type.

As shown in our study, earlier DNN-AES models focus mainly on the prompt-specific holistic scoring task. The commonly used baseline model is the RNN-based model (Taghipour and Ng 2016 ), which has been extended by incorporating an efficient word embedding representation, a hierarchical structure of a text, a coherence model, and manually designed features. We also described transformer-based models such as BERT that have recently been applied to AES with their widespread use in various machine learning research studies.

These prompt-specific holistic scoring models have been extended for prompt-specific trait scoring, which predicts multiple trait scores for each essay. Trait scoring is practically important, especially when we need to provide detailed feedback to examinees for educational purposes, although the number of papers for this task is still limited.

Although prompt-specific scoring tasks assume that we can use a sufficient number of rated essays for a target prompt, this assumption is not often satisfied in practice because collecting rated essays is an expensive and time-consuming task. To overcome this limitation, cross-prompt scoring models have provided frameworks that use a large number of essays for non-target prompts. Although the number of cross-prompt scoring models is still limited, this task is important for increasing the feasibility of applying DNN-AES models to practical situations.

We can use several corpora to develop and evaluate AES models. ASAP corpus, which was released as part of a Kaggle competition, has been commonly used in holistic scoring models. For the trait scoring models, the International Corpus of Learner English (Ke 2019 ) and ASAP++ corpus (Mathias et al. 2018 ) are available. See (Ke and Ng 2019 ) for a more detailed summary of these corpora.

A future direction of AES studies is developing efficient and accurate trait scoring models and cross-prompt models. As described above, although the number of studies for those DNN-AES models is limited, such studies are essential to the use of AES technologies in various situations. It is also important to develop methodologies that reduce costs and noise when training data are being created. Approaches to reducing rating costs include recently examined active learning approaches (e.g., Hellman et al. 2019 ). To reduce scoring noise or biases, the integration of statistical models such as the IRT models described in Sect.  3.7 would be a possible approach.

Another future direction is to analyze the quality of each essay test and the characteristics of an applied AES model based on test theory. From the perspective of test theory, evaluating the reliability and validity of a test and its scoring processes is important for discussing the appropriateness of the test as a measurement tool. Although AES studies tend to ignore these points, several works have considered the relationship between DNN-based AES tasks and test theory (e.g., Uysal and Doğan 2021 ; Uto and Uchida 2020 ; Ha et al. 2020 ).

The application of AES methods to various related domains is also desired. For example, AES methods would be applicable to various operations such as writing support systems (e.g., Ito et al. 2020 ; Tsai et al. 2020 ) and peer grading processes (Han et al. 2020 ).

Abosalem Y (2016) Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Int J Secondary Educ 4(1):1–11

Article   Google Scholar  

Alikaniotis D, Yannakoudakis H, Rei M (2016) Automatic text scoring using neural networks. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 715–725)

Amidei J, Piwek P, Willis A (2020) Identifying annotator bias: a new irt-based method for bias identification. In: Proceedings of the international conference on computational linguistics (pp. 4787–4797)

Amorim E, Cançado M, Veloso A (2018) Automated essay scoring in the presence of biased ratings. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 229–237)

Aomi I, Tsutsumi E, Uto M, Ueno M (2021) Integration of automated essay scoring models using item response theory. In: Proceedings of the international conference on artificial intelligence in education (pp. 54–59)

Attali Y, Burstein J (2006) Automated essay scoring with e-rater v.2. J Technol, Learn Assessment 4(3):1–31

Google Scholar  

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv

Beigman Klebanov B, Flor M, Gyawali B (2016) Topicality-based indices for essay scoring. In: Proceedings of the workshop on innovative use of NLP for building educational applications (pp. 63–72)

Bernardin HJ, Thomason S, Buckley MR, Kane JS (2016) Rater rating-level bias and accuracy in performance appraisals: the impact of rater personality, performance management competence, and rater accountability. Hum Resour Manage 55(2):321–340

Borade JG, Netak LD (2021) Automated grading of essays: a review. In: Intelligent human computer interaction (vol. 12615, pp. 238–249), Springer International Publishing

Cao Y, Jin H, Wan X, Yu Z (2020) Domain-adaptive neural automated essay scoring. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (pp. 1011–1020), Association for Computing Machinery

Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: From pairwise approach to listwise approach. In: Proceedings of the international conference on machine learning (pp. 129–136), Association for Computing Machinery

Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the international conference on machine learning (pp. 160–167), Association for Computing Machinery

Cozma M, Butnaru A, Ionescu RT (2018) Automated essay scoring with string kernels and word embeddings. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 503–509)

Dascalu M, Westera W, Ruseti S, Trausan-Matu S, Kurvers H (2017) Readerbench learns Dutch: building a comprehensive automated essay scoring system for Dutch language. In: Proceedings of the international conference on artificial intelligence in education (pp. 52–63)

Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the workshop on natural language processing techniques for educational applications, association for computational linguistics (pp. 93–102)

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186)

Dong F, Zhang Y (2016) Automatic features for essay scoring—an empirical study. In: Proceedings of the conference on empirical methods in natural language processing (pp. 1072–1077), Association for Computational Linguistics

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the conference on computational natural language learning (pp. 153–162), Association for Computational Linguistics

Eckes T (2015) Introduction to many-facet Rasch measurement: analyzing and evaluating rater-mediated assessments, Peter Lang Pub. Inc

Farag Y, Yannakoudakis H, Briscoe T (2018) Neural automated essay scoring and coherence modeling for adversarially crafted input. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 263–271)

Ha LA, Yaneva V, Harik P, Pandian R, Morales A, Clauser B (2020) Automated prediction of examinee proficiency from short-answer questions. In: Proceedings of the international conference on computational linguistics (pp. 893–903)

Han Y, Wu W, Yan Y, Zhang L (2020) Human-machine hybrid peer grading in SPOCs. IEEE Access 8:220922–220934

Hellman S, Rosenstein M, Gorman A, Murray W, Becker L, Baikadi A, Foltz PW (2019) Scaling up writing in the curriculum: Batch mode active learning for automated essay scoring. In: Proceedings of the ACM conference on learning (pp. 1—10), Association for Computing Machinery

Hua C, Wind SA (2019) Exploring the psychometric properties of the mind-map scoring rubric. Behaviormetrika 46(1):73–99

Huang J, Qu L, Jia R, Zhao B (2019) O2U-Net: a simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE international conference on computer vision (pp. 3326–3334)

Hussein MA, Hassan HA, Nassef M (2019) Automated language essay scoring systems: a literature review. Peer J Comput Sci 5:e208

Hussein MA, Hassan HA, Nassef M (2020) A trait-based deep learning automated essay scoring system with adaptive feedback. Int J Adv Comput Sci Appl 11(5):287–293

Ito T, Kuribayashi T, Hidaka M, Suzuki J, Inui K (2020) Langsmith: n interactive academic text revision system. In: Proceedings of conference on empirical methods in natural language processing (pp. 216–226), Association for Computational Linguistics

Jin C, He B, Hui K, Sun L (2018) TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 1088–1097)

Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 133–142), Association for Computing Machinery

Kassim NLA (2011) Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online J Lang Stud 11(3):179–197

Ke Z, Inamdar H, Lin H, Ng V (2019) Give me more feedback II: Annotating thesis strength and related attributes in student essays. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 3994–4004)

Ke Z, Ng V (2019) Automated essay scoring: a survey of the state of the art. In: Proceedings of the international joint conference on artificial intelligence (pp. 6300–6308)

Li S, Ge S, Hua Y, Zhang C, Wen H, Liu T, Wang W (2020) Coupled-view deep classifier learning from multiple noisy annotators. In: Proceedings of the association for the advancement of artificial intelligence (pp. 4667–4674)

Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-based automated essay scoring using self-attention. In: Chinese computational linguistics and natural language processing based on naturally annotated big data (pp. 386–397), Springer International Publishing

Li X, Chen M, Nie JY (2020) SEDNN: shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowl-Based Syst 210:106491

Liu OL, Frankel L, Roohr KC (2014) Assessing critical thinking in higher education: current state and directions for next-generation assessment. ETS Res Rep Series 1:1–23

Liu T, Ding W, Wang Z, Tang J, Huang GY, Liu Z (2019) Automatic short answer grading via multiway attention networks. In: Proceedings of the international conference on artificial intelligence in education (pp. 169–173)

Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the association for the advancement of artificial intelligence (pp. 13389–13396)

Mark D, Shermis JCB (2016) Automated essay scoring: a cross-disciplinary perspective. Taylor & Francis

Mathias S, Bhattacharyya P (2018) ASAP++: enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the eleventh international conference on language resources and evaluation (pp. 1169–1173)

Mathias S, Bhattacharyya P (2020) Can neural networks automatically score essay traits? In: Proceedings of the workshop on innovative use of nlp for building educational applications (pp. 85–91), Association for Computational Linguistics

Mayfield E, Black AW (2020) Should you fine-tune BERT for automated essay scoring? In: Proceedings of the workshop on innovative use of nlp for building educational applications (pp. 151–162), Association for Computational Linguistics

Mesgar M, Strube M (2018) A neural local coherence model for text quality assessment. In: Proceedings of the conference on empirical methods in natural language processing (pp. 4328–4339)

Mim FS, Inoue N, Reisert P, Ouchi H, Inui K (2019) Unsupervised learning of discourse-aware text representation for essay scoring. In: Proceedings of the annual meeting of the association for computational linguistics: student research workshop (pp. 378–385)

Myford CM, Wolfe EW (2003) Detecting and measuring rater effects using many-facet Rasch measurement: part I. J Appl Meas 4:386–422

Nadeem F, Nguyen H, Liu Y, Ostendorf M (2019) Automated essay scoring with discourse-aware neural models. In: Proceedings of the workshop on innovative use of NLP for building educational applications, association for computational linguistics (pp. 484–493)

Nguyen HV, Litman DJ (2018) Argument mining for improving the automated scoring of persuasive essays. In: Proceedings of the association for the advancement of artificial intelligence (pp. 5892–5899)

Phandi P, Chai KMA, Ng HT (2015) Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the conference on empirical methods in natural language processing (pp. 431–439)

Rahman AA, Ahmad J, Yasin RM, Hanafi NM (2017) Investigating central tendency in competency assessment of design electronic circuit: analysis using many facet Rasch measurement (MFRM). Int J Inf Educ Technol 7(7):525–528

Ridley R, He L, Dai X, Huang S, Chen J (2020) Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv

Ridley R, He L, yu Dai X, Huang S, Chen J (2021) Automated cross-prompt scoring of essay traits. In: Proceedings of the AAAI conference on artificial intelligence (vol 35, pp. 13745-13753)

Rodriguez PU, Jafari A, Ormerod CM (2019) Language models and automated essay scoring. arXiv

Rosen Y, Tager M (2014) Making student thinking visible through a concept map in computer-based assessment of critical thinking. J Educ Comput Res 50(2):249–270

Schendel R, Tolmie A (2017) Assessment techniques and students’ higher-order thinking skills. Assess & Eval Higher Educ 42(5):673–689

Song W, Zhang K, Fu R, Liu L, Liu T, Cheng M (2020) Multi-stage pre-training for automated Chinese essay scoring. In: Proceedings of the conference on empirical methods in natural language processing (pp. 6723–6733), Association for Computational Linguistics

Sung C, Dhamecha TI, Mukhi N (2019) Improving short answer grading using transformer-based pre-training. In: Proceedings of the international conference on artificial intelligence in education (pp. 469–481)

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the conference on empirical methods in natural language processing (pp. 1882–1891)

Tay Y, Phan MC, Tuan LA, Hui SC (2018) SKIPFLOW: Incorporating neural coherence features for end-to-end automatic text scoring. In: Proceedings of the AAAI conference on artificial intelligence (pp. 5948–5955)

Tsai CT, Chen JJ, Yang CY, Chang JS (2020) LinggleWrite: a coaching system for essay writing. In: Proceedings of annual meeting of the association for computational linguistics (pp. 127–133), Association for Computational Linguistics

Uto M (2019) Rater-effect IRT model integrating supervised LDA for accurate measurement of essay writing ability. In: Proceedings of the international conference on artificial intelligence in education (pp. 494–506)

Uto M, Okano M (2020) Robust neural automated essay scoring using item response theory. In: Proceedings of the international conference on artificial intelligence in education (pp. 549–561)

Uto M, Uchida Y (2020) Automated short-answer grading using deep neural networks and item response theory. In: Proceedings of the artificial intelligence in education (pp. 334–339)

Uto M, Ueno M (2016) Item response theory for peer assessment. IEEE Trans Learn Technol 9(2):157–170

Uto M, Ueno M (2018a) Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier 4(5):1–32

Uto M, Ueno M (2018b) Item response theory without restriction of equal interval scale for rater’s score. In: Proceedings of the international conference on artificial intelligence in education (pp. 363–368)

Uto M, Ueno M (2020) A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo. Behaviormetrika, Springer 47(2):469–496

Uto M, Xie Y, Ueno M (2020) Neural automated essay scoring incorporating handcrafted features. In: Proceedings of the international conference on computational linguistics (pp. 6077–6088), International Committee on Computational Linguistics

Uysal İ, Doğan N (2021) Automated essay scoring effect on test equating errors in mixed-format test. Int J Assess Tools Educ 8:222–238

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Proceedings of the international conference on advances in neural information processing systems (pp. 5998–6008)

Wang Y, Wei Z, Zhou Y, Huang X (2018) Automatic essay scoring incorporating rating schema via reinforcement learning. In: Proceedings of the conference on empirical methods in natural language processing (pp. 791–797)

Yang R, Cao J, Wen Z, Wu Y, He X (2020) Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In: Findings of the association for computational linguistics: EMNLP 2020 (pp. 1560–1569), Association for Computational Linguistics

Yang Y, Zhong J (2021) Automated essay scoring via example-based learning. In: Brambilla M, Chbeir R, Frasincar F, Manolescu I (eds) Web engineering. Springer International Publishing, pp 201–208

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers 19H05663 and 21H00898.

Author information

Authors and affiliations.

The University of Electro-Communications, Tokyo, Japan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Ethics declarations

Conflict of interest.

The authors have no conflicts of interest directly relevant to the content of this article.

Additional information

Communicated by Kazuo Shigemasu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team .

About this article

Uto, M. A review of deep-neural automated essay scoring models. Behaviormetrika 48 , 459–484 (2021). https://doi.org/10.1007/s41237-021-00142-y

Download citation

Received : 18 June 2021

Accepted : 08 July 2021

Published : 20 July 2021

Issue Date : July 2021

DOI : https://doi.org/10.1007/s41237-021-00142-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Automated essay scoring
  • Deep neural networks
  • Natural language processing
  • Educational/psychological measurement
  • Find a journal
  • Publish with us
  • Track your research

Neural Automated Essay Scoring Incorporating Handcrafted Features

Masaki Uto , Yikuan Xie , Maomi Ueno

Export citation

  • Preformatted

Markdown (Informal)

[Neural Automated Essay Scoring Incorporating Handcrafted Features](https://aclanthology.org/2020.coling-main.535) (Uto et al., COLING 2020)

  • Neural Automated Essay Scoring Incorporating Handcrafted Features (Uto et al., COLING 2020)
  • Masaki Uto, Yikuan Xie, and Maomi Ueno. 2020. Neural Automated Essay Scoring Incorporating Handcrafted Features . In Proceedings of the 28th International Conference on Computational Linguistics , pages 6077–6088, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Automated Essay Scoring With e-rater® V.2

  • Yigal Attali Educational Testing Service
  • Jill Burstein Educational Testing Service

How to Cite

  • Endnote/Zotero/Mendeley (RIS)

Developed By

Information.

  • For Readers
  • For Authors
  • For Librarians

Part of the PKP Publishing Services Network

More information about the publishing system, Platform and Workflow by OJS/PKP.

IMAGES

  1. Automated Essay Scoring Explained

    about automated essay scoring

  2. GitHub

    about automated essay scoring

  3. ASAP Benchmark (Automated Essay Scoring)

    about automated essay scoring

  4. Huang Zhilin

    about automated essay scoring

  5. (PDF) A hybrid scheme for Automated Essay Grading based on LVQ and NLP

    about automated essay scoring

  6. Cognitive-based automated essay scoring (CAES) model

    about automated essay scoring

VIDEO

  1. 3.3 PTE Writing -Essay Automated Scoring and Parameters Explained

  2. Automated Essay Grading

  3. Vo Hong Khanh

  4. AES-Neural-Network Architecture Approach

  5. AI4E®- Developing AI Literacy among Educators: AI in Automated Essay Scoring Systems

  6. Few Lines about ATM

COMMENTS

  1. Automated essay scoring

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding ...

  2. Automated Essay Scoring

    Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

  3. An automated essay scoring systems: a systematic literature review

    Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...

  4. What is Automated Essay Scoring, Marking, Grading?

    Nathan Thompson, PhD. Artificial Intelligence, Education. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general ...

  5. Explainable Automated Essay Scoring: Deep Learning Really Has

    Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of ...

  6. Automated Essay Scoring

    Part 1 establishes what automated essay scoring is about, why it exists, where the technology stands, and what are some of the main issues. In Part 2, the book presents guided exercises to illustrate how one would go about building and evaluating a simple automated scoring system, while Part 3 offers readers a survey of the literature on ...

  7. Automated Essay Scoring Systems

    The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968).PEG relies on proxy measures, such as average word length, essay length, number of certain punctuation marks, and so forth, to determine the quality of an open-ended response item.

  8. An automated essay scoring systems: a systematic literature review

    Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro Google Scholar; Wresch W The Imminence of Grading Essays by Computer-25 Years Later Comput Compos 1993 10 45 58 10.1016/S8755-4615(05)80058-1 Google Scholar Cross Ref

  9. (PDF) A Comprehensive Review of Automated Essay Scoring ...

    Automated Essay Scoring (AES) is a service or software that can predictively grade essay based on a pre-trained computational model. It has gained a lot of research interest in educational ...

  10. PDF An Overview of Automated Scoring of Essays

    Automated Essay Scoring Systems Project Essay Grader™ (PEG) Project Essay Grader™ (PEG) was developed by Ellis Page in 1966 upon the request of the College Board, which wanted to make the large-scale essay scoring process more practical and effective (Rudner & Gagne, 2001; Page, 2003). PEG™ uses correlation to predict the intrinsic ...

  11. Automated Essay Scoring

    This new volume is the first to focus entirely on automated essay scoring and evaluation. It is intended to provide a comprehensive overview of the evolution and state-of-the-art of automated essay scoring and evaluation technology across several disciplines, including education, testing and measurement, cognitive science, computer science, and computational linguistics.

  12. (PDF) Automated Essay Scoring Systems

    The fi rst widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966 , 1968 ). PEG relies on

  13. About the e-rater Scoring Engine

    The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style, organization and more.

  14. An automated essay scoring systems: a systematic literature review

    Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...

  15. Automated Essay Scoring

    Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

  16. Improving Automated Essay Scoring by Prompt Prediction and Matching

    1. Introduction. Automated essay scoring (AES), which aims to automatically evaluate and score essays, is one typical application of natural language processing (NLP) technique in the field of education [].In earlier studies, a combination of handcrafted design features and statistical machine learning is used [2,3], and with the development of deep learning, neural network-based approaches ...

  17. A review of deep-neural automated essay scoring models

    Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have ...

  18. Validity Arguments for Automated Essay Scoring of Young Students

    Automated essay scoring. AES dates to Project Essay Grade (PEG) in 1966 (Page, Citation 1966), which is considered the first AES application.Like many systems that followed until the early 2010s, PEG graded relatively simple (by current standards) components of essays such as syntactic complexity, grammar, and diction (Beigman Klebanov & Madnani, Citation 2020).

  19. Automated Essay Scoring via Pairwise Contrastive Regression

    Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning and ...

  20. PDF Automated Essay Scoring in Argumentative Writing: DeBERTeachingAssistant

    Automated Essay scoring has been explored as a research and industry problem for over 50 years. It has drawn a lot of attention from the NLP community because of its clear educational value as a research area that can engender the creation of valuable time-saving tools for educators around the world. Yet,

  21. Neural Automated Essay Scoring Incorporating Handcrafted Features

    Abstract Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by human raters. Conventional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks (DNNs) to obviate the need for feature engineering.

  22. Automated essay scoring: Psychometric guidelines and practices

    Automated essay scoring (AES) systems hold the potential for greater use of essays in assessment while also maintaining the reliability of scoring and the timeliness of score reporting desired for large-scale assessment. While this potential is appealing, we need to know when AES systems are of sufficient quality to be relied upon for scoring ...

  23. Exploring Effective Methods for Automated Essay Scoring of Non ...

    Automated essay scoring (AES) has become a valuable tool in educational settings, providing efficient and objective evaluations of student essays. However, the majority of AES systems have primarily focused on native English speakers, leaving a critical gap in the evaluation of non-native speakers' writing skills. This research addresses this gap by exploring the effectiveness of automated ...

  24. Automated Essay Scoring With e-rater® V.2

    Automated essay scoring, e-rater Abstract E-rater has been used by the Educational Testing Service for automated essay scoring since 1999. This paper describes a new version of e-rater (V.2) that is different from other automated essay scoring systems in several important respects. The main innovations of e-rater V.2 are a small, intuitive, and ...

  25. Teachers are using AI to grade essays. Students are using AI to write

    teaching ChatGPT best practices in her writing workshop class at the University of Lynchburg in Virginia, said she sees the advantages for teachers using AI tools but takes issue with how it can ...