ChatGPT and Enem: understand how AI does well and poorly in the test – 05/05/2023 – Education

[ad_1]

Despite performing better than most humans who took the Enem (National High School Exam), when answering the test, ChatGPT shows its difficulty with tasks that require devising and executing a sequence of steps. At the same time, the encyclopedic knowledge of the tool appears as its virtue.

The assessment comes from analysis of the Delta Sheetwhich helps to show some of the limitations of the tool.

Since its launch, ChatGPT has moved the field of artificial intelligence (AI) by demonstrating, in a chat format, a great ability to process written content, follow user instructions and generate texts that seem to be written by humans.

Because it is a language-oriented software, however, it presents erratic performance in some tasks. It is still difficult to predict exactly where problems occur, but it is known, for example, that he slips in mathematics and that he sometimes invents information.

To evaluate the tool, the report put it to solve tests of eight editions of Enem (applied from 2009 to 2017). Each question was identified according to the type of knowledge needed to answer it. With this, it was possible to know which task the tool has more difficulty in, and not just map performance by area of knowledge (biology and physics, for example).

As the artificial intelligence does not read images, the questions that depended on the feature were removed, leaving a total of 766 (see the complete list). In addition, the questions were divided by difficulty levels according to the students’ correct answers.

The tests used were classified by researchers Igor Cataneo Silveira and Denis Deratani Mauá, from the Institute of Mathematics and Statistics at USP (University of São Paulo). The work was aimed at creating a set of Enem questions for the assessment of AIs. Each question receives one or more of the following tags:

Encyclopedic knowledge: need to know a piece of data or information that is not in the header of the question, but can be read in a book, such as saying Newton’s 1st Law. Hit 223 out of 264 (84.5%)

Text comprehension: involves extracting information arranged in the question itself. Hit 449 out of 547 (82.1%)

Specific knowledge: requires some kind of inference or more advanced knowledge of a domain, such as differentiating, in practice, the concepts of heat and temperature. Hit 105 out of 148 (71%)

Knowledge in chemistry: requires formula manipulation, such as interpreting the transformation of chemical elements. Hit 10 out of 19 (52.6%)

Mathematical reasoning: includes turning instructions into a mathematical formula, as in problems with finding the value of X. Got 47 out of 135 (34.8%)

Category in which ChatGPT had the best performance, encyclopedic knowledge helped to leverage the robot’s successes in language and humanities tests. In the case of the natural ones, the result jumped in the questions that were considered easier and dropped in the more difficult ones. None of math tested this skill.

Text interpretation, on the other hand, helped the robot to do better in human and natural science tests. The modality appears in practically all the questions of the language tests and in none of the mathematics tests, making the evaluation impossible.

Mathematics, which was already known as one of the weaknesses of the OpenAI system, remains a limitation. In tests of Sheet made at the beginning of the month, it was in the test of this discipline that the robot received, by far, the worst grade.

The current analysis shows that the need for this reasoning also lowers the number of correct answers in the natural science test, the only one to demand all types of skills assessed. Still, the technology got it right better than the human competitors regardless of the type or difficulty of the question.

Another weak point was chemistry, present in 19 questions in natural science tests. Hit 10.

“The ChatGPT would do very badly in a chemistry test because it doesn’t understand what it’s doing”, says professor André Pimentel, from the Chemistry department at PUC-Rio. He is the author of a study that evaluates the performance of artificial intelligence in matter.

“The problem with chemistry is when you use representations, such as formulas. He doesn’t understand that as a word”, says the professor. “He understands context. If there is little information on the subject on the internet, something that only a specialist knows, it is difficult. Different from taking something like environmental chemistry, sustainability, with information found more easily”, he adds.

Questions that assess chemistry and math skills require not just understanding the context of what is being said, but creating and executing a sequence of steps to solve a problem.

A greater difficulty in the area appeared in an article published in March by researchers from USP, Unicamp and USF (Universidade São Francisco), also evaluating the Enem.

In the study, the group got better results by asking the AI not only to point out the correct answer, but to explain the logic it used in the solution.

The strengths and limitations of ChatGPT are also seen when each of the skills is analyzed in isolation. In these cases, the percentage of correct answers increases even more in encyclopedic knowledge (90.2%), and in mathematics it drops (31.2%). In text comprehension, the result is lower (80.6%). There are no purely chemistry questions and very few require specific skills (only 13).

Even with a better sense of where the technology’s strengths lie, confidence in its performance warrants caution.

“GPT-4 has a tendency to hallucinate, ie ‘producing content that is meaningless or untrue in relation to certain sources’. This tendency can be especially harmful as models become more convincing, leading to overconfidence by users “, warns the article that accompanied the launch of the most current version of the OpenAI system, in March.

The use of question batteries is one of the main ways to measure the performance of AIs. More sophisticated versions of these tools have been evaluated with tests originally aimed at humans.

“The Enem has very interdisciplinary questions, with several domains in the same question. And there is the fact that it is multiple choice, which makes it easier to compute the right answers”, says Denis Deratani Mauá, associate professor at the Department of Computer Science at USP and one of the responsible for the study that classified the questions.

The tests of Sheet were made with GPT version 3.5. Version 4 may offer better results, but it is not yet widely available in the form used by programmers, which makes it possible to do tests automatically. The report asked for access to OpenAI, but it was not answered.

[ad_2]

Source link

Tags: And either, artificial intelligence, ChatGPT, data, data journalism, Delta Sheet, education, Enem, poorly, sheet, test, Understand

ChatGPT and Enem: understand how AI does well and poorly in the test – 05/05/2023 – Education

Nísia argues with Girão and Damares says that the minister may be being sabotaged

Evolution of chemical analysis: learn more about the past, present and future of this area

Ice cover in the Arctic has continued declining trend since 1979 – 04/16/2024 – Environment

Osmar Terra denounces the left’s attempt to release “medical marijuana”

CNPq: researchers criticize repatriation program – 04/16/2024 – #Hashtag

“Democracy in Brazil is dying in broad daylight”, says WSJ

You may have missed

Drugs in Sport: Performance-Enhancing Drugs and Addiction

Outsourcing vs In-house: Pros, Cons, and When to Use Each

Very best Mail Purchase Bride Sites: A Seeing Information

Free Slots Offline: Enjoy Casino Games without a Web Connection

test

More Stories

You may have missed