ChatGPT and Enem: robot performs better than 80% of people – 04/05/2023 – Education

[ad_1]

If it were a student, ChatGPT would have an average score of 612.3 in the objective tests of Enem (National High School Exam). He would do better than 98.9% of students in the humanities and 95.3% in languages and codes. Overall, it would outperform 78.9% of applicants.

Performance across disciplines, however, is uneven: the robot does much worse in math, surpassing only 27% of exam participants. His performance in exact sciences would be the biggest obstacle to entering disputed courses at the main federal universities in the country.

The data are from DeltaFolha analysis based on artificial intelligence (AI) responses in tests carried out in five years, from 2017 to 2021, the most recent with publicly available individual scores, which allows calculating the robot’s final grade in each knowledge area.

ChatGPT answered 1,290 questions. It is a rare example of a study on this scale assessing technology in Portuguese.

Regarding 2020 and 2021, the two exam applications in each year were considered, which have totally different questions from each other.

The Enem result does not exactly match the percentage of correct questions. Correcting difficult questions and failing easily, for example, can be understood as guessing, and this is reflected in the final grade. The analysis of Sheet reproduced this calculation in order to directly compare resourcefulness between humans and AI.

The tests evaluated GPT-3.5, the technology of the original version of ChatGPT, using performance analysis tools made by OpenAI, creator of the robot.

For the first application of the 2021 test, the report asked the system to write an essay following the same test statement. To simulate the methodology of the Ministry of Education, the text was corrected by two specialists who used Enem criteria. The bot’s average grade was 700 — better than 68% of students, who averaged 613.

Adding the writing score to the average of the objective tests in 2021 (726.8 in human sciences, 606.2 in languages and codes, 577 in natural sciences and 433.6 in mathematics), the ChatGPT score in Enem was 608, 7.

The result is better than that obtained by 79% of students that year – the average was 535. It would be enough to guarantee access to courses such as social work at the Federal University of Pernambuco and social sciences at the Federal Fluminense University.

The evaluation considered the calculation that each course adopts (the weight of the disciplines differs depending on the degree). The grade would guarantee, according to Sisu (Unified Selection System), admission to 63 of the 938 options listed by ten of the best-placed federal universities in the 2019 Folha University Ranking.

The humanities were AI’s strong point. The five-year grade average was 725.3, higher than the 523.3 points of the students. In 2017, with the best score (785.3), the robot was surpassed by only 775 candidates (out of 4.7 million).

ChatGPT outperformed organic competitors in language and natural sciences as well. The average grade was 641.4 (against 516.1) and 639.2 (against 492.5), respectively.

By comparison, results on math tests are almost dismal. On average, the robot accumulated 443.1 points, below the 527.1 obtained by real candidates. It got between 13.6% and 27.3% of the questions right in each application – someone guessing all the answers should get something like 20% right.

A scientific article released last Wednesday (29) carried out an analysis similar to that of Sheet. In it, researchers from the universities of São Paulo (USP), São Francisco (USF) and Campinas (Unicamp) reached a similar performance standard, with poor grades in mathematics.

For Ricardo Primi, one of the authors, a possible explanation is that these questions require the robot to extract information from the question and follow a line of reasoning, such as setting up the necessary account, to arrive at the answer. In the case of human and languages, it is enough to access data that he has already seen, without having to execute anything.

In the group’s study, the outcome improved with a GPT induction – when, instead of just asking and waiting for a reply, the researchers gave some examples of questions answered earlier. The correct answers rose even more when they asked the technology to justify the answers.

“When a problem is presented in text, maybe he didn’t have that same data in the training process. He didn’t see the patterns of the reasoning steps explicitly”, says Primi.

Discipline appears as the Achilles heel of the system since its launch. OpenAI even announced improvements in the area at the end of January.

In March, the company released an update to this system, GPT-4, but it is not yet widely available. In official tests, the new version performed better than the old one in tests created for humans in relation to its predecessor.

Essay

In the writing test, the report asked the robot the same instructions as the Enem, using the 2021 test as an example. The statement asked for an argumentative-distortive text on “invisibility and civil registration: guaranteeing access to citizenship in Brazil”.

Enem considers five skills for textual evaluation (see art). Right off the bat, the two experts consulted stressed that the text would exceed the 30 lines allowed.

According to Adriano Chan, who gave the ChatGPT staff a score of 760, the text was cohesive, but left something to be desired in the other items. The professor points out that the robot failed in commas and syntactic construction, showed little sociocultural repertoire, failed to argue with concrete data and to propose an intervention to solve the problem.

Teacher Jéssica Dorta’s correction identified similar problems, with a score of 640. She took more points for lack of cohesion and discounted the intervention proposals.

Methodology

The mathematical model adopted by Enem, the Item Response Theory, predicts items calibrated according to discrimination parameters (candidates are differentiated according to the level of knowledge in that subject), difficulty and chance of chance success. In addition to the number of correct answers, the calculation considers which questions were answered correctly.

To reach the final ChatGPT score, the Sheet reproduced this methodology based on data from Inep (National Institute of Educational Studies and Research).

Through an interface for programmers, the robot answered each question only once, indicating the alternative it thought was correct, without a previous example. As the technology does not interpret images, the versions of the Enem de ledor were used, which are read aloud to candidates with visual difficulties, with the official descriptions of photos and graphics.

GPT has also been configured to be as uncreative as possible in responses, in order to limit any “rambling”. The chosen alternative was extracted from the robot replicas (see the complete list).

Language systems like GPT run on training: they are fed billions of textual data from which they extract word chaining patterns. In this process, the robot may already have seen some of the questions from the tests applied.

The data known by ChatGPT date back to September 2021, that is, there is a chance that he has already come across questions and answers from four tested Enem editions. The phenomenon, called contamination, however, appears to have a limited effect.

When disclosing GPT-4, researchers linked to OpenAI put the tool to solve a series of tests, such as the SAT (a kind of American Enem) and the test to become a lawyer in the USA. They found that contamination had little impact on the final result: the score was similar even disregarding the questions that the AI knew beforehand.

An analysis of the Enem removing the contaminated contents is impossible, since OpenAI does not reveal to the public which texts were used in the machine training process. In the Brazilian test, the results of the older tests were similar to the performance in the more recent test.

[ad_2]

Source link

Tags: And either, artificial intelligence, ChatGPT, data, data journalism, Delta Sheet, education, Enem, people, performs, robot, sheet, Sisu, technology

ChatGPT and Enem: robot performs better than 80% of people – 04/05/2023 – Education

Essay

Methodology

Nísia argues with Girão and Damares says that the minister may be being sabotaged

Evolution of chemical analysis: learn more about the past, present and future of this area

Ice cover in the Arctic has continued declining trend since 1979 – 04/16/2024 – Environment

Osmar Terra denounces the left’s attempt to release “medical marijuana”

CNPq: researchers criticize repatriation program – 04/16/2024 – #Hashtag

“Democracy in Brazil is dying in broad daylight”, says WSJ

You may have missed

10 Finest Foreign Exchange Robot Merchants ️updated 2024*

test

test

What Is So Fascinating About Marijuana News?

“PT is masterminding the apocalypse of the spending hole”, says senator

Essay

Methodology

More Stories

You may have missed