Using ChatGPT-generated test answers, researchers found the artificial intelligence responses offered “a compelling and sometimes better rationale” than what the official answer key shows.
How should a social worker coaching a domestic violence survivor offer advice about returning to an abusive partner? What are the first steps to help a foster child throwing temper tantrums whose parent is incarcerated? How can they tell if a client is maintaining sobriety?
On these and other questions, Michigan researchers have concluded that a computer-generated model scanning massive online datasets of texts, books, articles and webpages drew more accurate conclusions than the “correct” answers determined by the Association of Social Work Boards, which created and administers licensing exams for the nation’s social workers.
The findings, published in March by the journal Research on Social Work Practice, concluded that answers that would have been marked incorrect on the exams instead represented a safer and more ethical response — raising fundamental questions about the validity of the test that serves as the gateway to the country’s frontline social service jobs.
“When you asked for the rationale for how the questions were answered when they were ‘incorrect,’ it was perfectly logical,” said one of the study’s authors, Wayne State University School of Social Work Dean Sheryl Kubiak. “Any instructor would have marked it as ‘correct,’ because it was bringing in the context.”
According to the answers to the tests that AI came up with, some “correct” answers relied on dated concepts, or had little evidence to back them up. In other instances, the multiple-choice format failed to offer nuanced and informed approaches to delicate, high-stakes decisions, such as how to approach a deaf client when an interpreter cancels an appointment, or the best way to help children through the death of a parent.
The researchers were so concerned by the findings that their abstract calls on state regulators and legislators to temporarily suspend the exam created and administered by the Association of Social Work Boards, while a more “appropriate, effective, and ethical” test is developed.
They also offer cautious optimism for “generative AI” and “large language models” to inform the field of social work — joining a fast-growing chorus of critics and cheerleaders of computer programs that can simulate human thought processes and creation, methods now reaching into virtually all aspects of society.
In response to The Imprint’s request for comment, leaders of the Association of Social Work Boards (ASWB) said the licensing exams for clinical, bachelor’s and master’s level social work are rigorously vetted. They dismissed the study’s assertion that some of the questions are biased, outdated or based on poorly supported practice.
“Every correct answer is supported by a valid and current social work reference,” Senior Director of Examination Services Lavina Harless said in an email.
The researchers were so concerned by the findings that their abstract calls on state regulators and legislators to temporarily suspend the exam.
Harless said that the exam questions are “thoroughly and continually reviewed by testing experts” and that any questions that show potential bias or don’t appropriately evaluate social work competency are not included on the final tests.
CEO Stacey Hardy-Chandler also responded by email, stating her organization updates the licensure exams every five to seven years, and that the process involves surveying thousands of social workers to ensure the questions reflect current practice. The association’s next analysis of the exam will begin in 2024.
“ASWB is taking a rigorous and thoughtful approach to enhancing the licensing exams,” she said, but cautioned about any hasty changes.
“We are committed to being vigilant in our review of all available tools, including emerging technologies such as AI,” she added. “However, it is premature to jump to conclusions about the development of future exams based on the findings of a single study.”
Multiple-choice format eliminates nuance of social work
The study was prompted by alarming racial disparities in the social worker licensing test passing rates revealed last year. The researchers set out to determine if artificial intelligence could suss out potential bias in the questions and “move us toward a more valid and equitable exam.”
Lead author and Wayne State University School of Social Work professor Bryan Victor and his team did not analyze the exam, which is not publicly available. Instead, they relied on an official set of practice questions.
Hardy-Chandler characterized the practice questions as “retired,” and not reflected in the actual exam. But in defending the analysis, Victor noted that the practice questions were pulled from the social work board’s online portal, and they are used by educators preparing students for the licensing exams.
“Our review of available documentation related to these exam questions revealed no indication that these questions are outdated or not aligned with current research, which could potentially mislead educators or students,” Victor wrote in an email.
His study relied on ChatGPT for its findings, which the authors describe as “currently taking the world by storm.” The artificial intelligence program was instructed to complete the practice questions for the licensing exam and explain why it selected its answers. This “think aloud” function helped researchers pinpoint potential problems with the questions, and the limited way test-takers could provide appropriate answers, particularly when they could only select one option in a multiple-choice format.
In some cases, the differences between the “correct” answer the ASWB identified and the “incorrect” one ChatGPT produced was a matter of nuance and depth. In one example described by researchers, both answers focused on properly engaging a client at a domestic violence shelter in a discussion about the choice to return to an abusive partner.
“The correct response according to ASWB was to encourage the client to further discuss their decision,” the study states, an answer “consistent with the ethical principle of self-determination.”
ChatGPT selected a different response that it determined was more accurate. It specified that in discussions over the decision, the client, not the social worker, must guide the conversation. That distinction was key to a more accurate answer, the researchers stated, because the social worker must demonstrate acceptance of the client’s choice, even if it ends up being a return to the abusive partner.
Doing otherwise, could “create further barriers to the client’s engagement,” the ChatGPT response concluded. “Allowing the client to direct the conversation and to express their thoughts and feelings can help to build trust and rapport between the client and the social worker,” which can increase “the likelihood that the client will be open to receiving support and services, regardless of their decision.”
The computer-generated response stated that the social worker should outline concerns about the risk of returning to the abusive partner, but noted the importance of doing so “in a non-judgmental and supportive manner.”
The multiple-choice format failed to offer nuanced and informed approaches to delicate, high-stakes decisions, such as how to approach a deaf client when an interpreter cancels an appointment.
One of the study’s authors Brian Perron — a professor of social work at the University of Michigan-Ann Arbor — described the significance of this analysis in a follow-up article published last month by Towards Data Science: “By neglecting the complexity and context of actual practice, the exam is not adequately assessing the competence of social workers.” Perron concluded that “we have serious reservations about considering the ASWB exam key as the gold standard. The exam has flaws and biases, including using empirically unsupported test items.”
Previous exam controversy
Problems with the social worker licensing exam date back to at least 2010, when a study concluded it evaluated test-taking abilities more than competency to practice. And last year, a study published by the Association of Social Work Boards about its own licensing exam revealed gaping disparities in the pass rates based on race, age and native language. The study revealed that just 45% of Black test-takers passed the exam on the first try, compared with 84% of white test-takers. The disparity remained consistent on follow-up attempts, with 91% of white people eventually passing the test, compared to 57% of Black people.
The ability to score well on standardized tests has long been found to reflect societal bias, pushing out people of color from success on qualifying exams for everything from college admission to the legal and pharmaceutical fields. According to a 2021 article published by the National Education Association, standardized tests “have been instruments of racism and a biased system.”
Not every state requires social workers to pass an exam and receive a license to practice, but many do. And even when a license is not required, it can give social workers a strong advantage for upward mobility in their careers.
The pass-rate disparity data rocked the social work field, confirming longstanding concerns about the consequential licensing exams.
“It was both horrible, but also validating,” said Anthony Estreet, CEO of the National Association of Social Workers. “Validating the concerns that people have had for years in terms of the bias that is in the exam, but also validating to those that have failed the exam that it’s not them.”
Adding to the critiques, earlier this year the National Association of Social Workers — representing more than 100,000 members — announced formal opposition to the licensing exams. The influential body called for alternatives that promote “the diversity and well-being of the social work profession, and the health and well-being of the populations social workers serve.”
“An entirely new era”
Artificial intelligence is the latest tool to be used in a critique of the social work licensing exam.
With its ability to mimic human output based on analysis of vast amounts of online data, AI has its own set of controversies — from privacy violations to job-loss driven by automation, the spread of misinformation and algorithmic bias. With sufficient regulation and used in proper ways, the method is also considered to be a powerful and effective tool, supporters argue.
The Michigan researchers decided to analyze the social work licensing exam with AI after learning of similar analyses conducted on the bar exam and medical licensing tests for doctors. On both, computer systems received high marks.
“We recognize this is an entirely new era of rapidly growing technologies, which necessarily requires the field to be cautious moving forward,” the Michigan researchers state. “Importantly, we want to be clear that we see generative AI models as tools that can help social workers, but we do not think these tools can replace social workers.”
“We want to be clear that we see generative AI models as tools that can help social workers, but we do not think these tools can replace social workers.”From the report, “Time to Move Beyond the ASWB Licensing Exams”
They go on to state some possible ways the emerging technology can inform the field.
For example, the authors describe a section of the social worker practice exam that presents a scenario and asks test-takers to select from four possible responses as to what the social worker should do “next” or “first.”
In it, a foster child has an incarcerated father and a mother in a residential treatment program for alcoholism. The child is described as small for his age, behind on his speech, behind on schoolwork and throwing temper tantrums.
ChatGPT did not select any of the four responses provided on the practice test, which included “screen for fetal alcohol syndrome, develop a behavior modification plan, refer to special education, or pursue family reunification.”
Instead, it offered an answer that was not provided: The social worker should “gather more information and assess the child’s needs.” A comprehensive assessment would best determine the child’s developmental, behavioral and educational needs, as well as any past trauma or neglect, the computer-assisted program determined. Only then could the social worker decide on the appropriate interventions, such as a medical referral, special education services, working with foster parents on a behavior plan, or focusing on reunification with the biological mother, if appropriate.
“We think the response of ChatGPT is the better answer — as do our colleagues with clinical expertise who we consulted — because the scenario does not contain enough contextual information to make an informed decision,” the Michigan study authors concluded.
Another example when ChatGPT provided a different answer than ASWB developers involved a question about the “most likely emotional response” when adult children learn of a parent’s Alzheimer’s diagnosis. The computer-generated response determined “a range of emotional reactions” are likely. “Denial” was the correct test answer — based on the “five stages of grief” introduced by psychiatrist Elisabeth Kübler-Ross in 1969. “ASWB’s correct answer does not have sufficient empirical support to inform practice,” the Michigan study authors concluded, noting that health researchers have cautioned against the continued use of the Kübler-Ross model, which is widely considered to be outdated.
Professor Perron said inadequate context is a fundamental weakness of an exam based on just one correct answer that assesses the ability to perform a skill as subjective as social work. For that reason, he said, older test-takers whose answers may incorporate years of real-world professional and life experience are more likely to struggle with the ASWB exam format, compared with a recent graduate who is answering solely based on their classroom training.
Kubiak, who co-chairs the National Association of Deans and Directors of Schools of Social Work’s task force on the licensure exam, said people of color also bring a different perspective and set of experiences to social work scenarios that may not be captured or fairly assessed in the current licensing exam.
As a potential remedy, the research trio tested ChatGPT’s ability to “grade” short-answer questions. They fed the program a sample question, along with three potential responses, instructing it to determine which of the answers reflected safe and ethical social work practice. They reported that AI was able to do so, correctly identifying a plan to refer a queer client to conversion therapy as “harmful.”
The researchers conclude the takeaway from this exercise was that shifting to a short-answer exam could better include the important context missing from the current multiple-choice format.
Yet as with other data applications in social work, such as predictive analytics, artificial intelligence must be fully analyzed before it is put to use more broadly in the human services field, the researchers concluded, “because generative AI models are just in their infancy and still prone to frequent errors.”
Of primary concern are the massive amounts of internet data AI relies on to form its “intelligence.” Any bias that exists in that data is baked into its output. Mindful of this, the Michigan researchers state, if AI is to be used in future redesigns of the social work licensing exam, it would require experts collaborating with computer scientists to build in guardrails.
“We’re a long way off from using AI in any kind of testing context,” Victor said in an interview. “We just wanted to put it on the table as one potential path forward that we think is worth exploring.”