spot_img
HomeResearch & DevelopmentGenerative AI's Ability to Interpret Idioms in Essay Scoring:...

Generative AI’s Ability to Interpret Idioms in Essay Scoring: A Comparative Study

TLDR: A study evaluated ChatGPT, Gemini, and Deepseek’s ability to score student essays, focusing on the impact of idioms. All models showed high internal consistency, but Gemini demonstrated superior agreement with human raters, particularly when essays contained figurative language. The research suggests Gemini is the best candidate among the tested models for automated essay scoring, especially in handling complex linguistic expressions, and highlights the potential for effective hybrid human-AI assessment systems.

The rapid advancements in Generative AI have opened new avenues across various fields, including education. One particularly intriguing application is the use of AI for automated essay scoring (AES), a task traditionally demanding significant time and resources from human educators. However, a critical question arises: can these AI models accurately interpret complex linguistic nuances, especially figurative language like idioms?

A recent study by Enis Oğuz delves into this very question, assessing how three prominent Generative AI models—ChatGPT, Gemini, and Deepseek—perform when scoring student essays, specifically examining the influence of idioms on their evaluations. The research combines insights from Corpus Linguistics and Computational Linguistics to provide a comprehensive analysis.

The Challenge of Figurative Language for AI

Automated Essay Scoring systems have been around since the 1960s, evolving significantly with contributions from computer science and linguistics. While these tools offer efficiency and reliability, the advent of Generative AI, with its massive language models (like GPT-3’s 175 billion parameters compared to BERT’s 350 million), has sparked debate about whether they can surpass traditional AES systems. A key concern is AI’s ability to interpret figurative language. If an AI struggles with idioms, it could unfairly penalize students who skillfully incorporate such expressions, despite their linguistic proficiency.

How the Study Was Conducted

To investigate this, the study adopted an empirical design, utilizing 348 student essays from the PERSUADE 2.0 corpus. This corpus contains argumentative essays from 6th to 12th-grade native English speakers, already scored by trained human raters using a holistic rubric.

The researcher created an idiom list from established dictionaries and developed an R script to identify these idioms within the essays. After careful validation, 10,384 matched idioms were found. Two equal essay lists were then compiled: one with multiple idioms present in each essay (174 essays) and a control group of 174 essays with no idioms, carefully matched for human-assigned scores and word counts.

Three Generative AI models—ChatGPT (gpt-4o), Gemini (1.5 Pro), and Deepseek (V3)—were tasked with scoring all essays in both lists three times over consecutive days, using the same rubric as human raters. This allowed for the evaluation of both internal consistency (how consistently an AI scores the same essay) and inter-rater reliability (how well AI scores align with human scores).

Key Findings: Consistency, Reliability, and Idioms

The study yielded several important insights:

  • Overall Scoring Tendency: Generative AI models generally assigned lower scores to essays compared to human raters. DeepSeek, in particular, showed a limited variety in its scores, clustering them around 4 points, which, while contributing to high consistency, might limit its alignment with human nuances.
  • Internal Consistency: All three AI models demonstrated excellent consistency in scoring the same essays across different rounds. DeepSeek achieved the highest consistency (ICC of 0.807), followed closely by Gemini (0.796) and ChatGPT (0.753).
  • Inter-rater Reliability with Humans: Gemini emerged as the leader in aligning with human raters, achieving a notable ICC value of 0.735. DeepSeek followed with 0.695, and ChatGPT with 0.673. These values suggest that Generative AI can provide acceptable inter-rater reliability, with Gemini showing particular promise.
  • Influence of Idioms: The presence of idioms did impact reliability. While Gemini maintained a good reliability level with human raters even when idioms were present, ChatGPT and Deepseek showed less inter-rater reliability in such cases. This supports previous research indicating AI’s challenges with figurative language, possibly due to its underrepresentation in training datasets.
  • Scoring Patterns with Idioms: Human raters showed an initial increase in scores for essays with a few idioms, followed by a decrease as the number of idioms increased, suggesting a potential penalty for idiom repetition. Gemini mirrored this nuanced pattern most closely. ChatGPT and Deepseek, however, exhibited less varied scoring patterns, struggling to capture the subtlety of idiom usage and repetition as effectively as Gemini and human raters.

Also Read:

Implications for Essay Scoring

The findings suggest that while all Generative AI models show promising consistency, Gemini stands out for its ability to mimic human scoring patterns, especially concerning the presence and repetition of idioms. This indicates that Gemini possesses a more sophisticated understanding of figurative language compared to its competitors in this context.

The research supports the idea of a hybrid essay-scoring procedure, combining human raters with AI tools. While models like ChatGPT and Deepseek could serve as a second rater, human oversight remains crucial, particularly for essays rich in figurative language. Gemini, however, shows strong potential to not only collaborate effectively in such a hybrid system but also to potentially handle essay-scoring tasks independently in the future, thanks to its precision in analyzing idiom usage patterns akin to trained human raters.

For more detailed information, you can access the full research paper here.

Meera Iyer
Meera Iyerhttp://edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -