Recently a friend shared with me a “Custom GPT” called “Accessibility Copilot”. This is one of many interesting applications of ChatGPT that are arriving to market at breakneck speed, but is unfortunately not something I would rely on for effective remediation of accessibility issues.

I’d like to preface this by saying that I’m not one to poo-poo AI. In fact, I have a paid subscription and use ChatGPT on a regular basis. I’ve used it to write marketing material, use it to improve existing code I’ve written, and have used it on new development projects. ChatGPT is immensely useful, especially when it is used by someone who is already an expert in something that they’re asking ChatGPT to do. So why am I saying it isn’t ready to handle web accessibility remediation? Mostly because ChatGPT doesn’t have an opinion.

Why opinion is important and why ChatGPT doesn’t have one

Consumers of accessibility testing products are not using such products for the purpose of finding bugs in their system, but rather to fix the bugs in their system. To do so, they need to know three things:

  1. What the issue is
  2. Where the issue is
  3. How to fix the issue

The importance of an opinion is mostly demonstrated in #1 and #3, above. Standards, such as WCAG, seek to remove the influence of opinion in #1 – and rightly so. Since WCAG 2.0, the W3C has aimed to make the requirements testable and a key criteria for testability is that the requirements be both granular enough and clear enough that a test can be performed that is clearly pass-fail. That being said, there are still some situations where enough subjectivity exists to make passing or failing a matter of opinion. #3, above, is where the most subjectivity exists in accessibility.

The W3C has created hundreds of techniques describing ways to possibly conform to the standard. These techniques are by no means complete and are certainly not the only ways one could conform. Nonetheless there are dozens of techniques cataloged for WCAG 1.1.1 alone. How you conform to WCAG depends on a myriad of factors that are difficult to predict, which is exactly why the W3C isn’t prescriptive on which approaches to take.

Currently, one of the bigger shortcomings of accessibility testing tools is the fact that although they all do a mostly good job of finding issues, they do not provide the exact code necessary to repair the issue found, opting instead to show an example of what passing code should look like.

This seems like the type of thing that ChatGPT would be perfect for. Unfortunately, this is exactly an area that demonstrates why generative AI is unable to handle accessibility and why opinion is so important.

To understand why, it is important to understand how generative AI (like ChatGPT) works. At its simplest, generative AI works by predicting the most likely output in response to the input. As a user, getting good responses by ChatGPT depends largely on providing a good prompt. Even then, what comes back isn’t a programmed response but is, instead, a better prediction of what the proper response should be. As evidence, consider how bad ChatGPT is at math and how prone it is to just making things up. It should be no wonder, therefore, that generative AI is poor at providing guidance on topics, like accessibility remediation, where there’s a fair amount of subjectivity.

How ChatGPT performs at providing remediation guidance

To check how useful ChatGPT was in providing remediation guidance, I ran several random URLs through Tenon. Tenon can export a CSV file with all issue information including boilerplate issue descriptions and remediation guidance. The issue export from this testing was used to assist in the creation of prompts to feed ChatGPT. Each prompt was structured in a way that provided Tenon’s, Issue Description, Remediation Guidance, and Error Snippet.

Each prompt provided to ChatGPT used the following structure:

This piece of code has accessibility errors. The error is: {issueDescription}, {remediationGuidance} The code that has the error is: `{errorSnippet}`. Revise the code so that it no longer has the error and explain what was done to fix the error.

Although the prompts provided to ChatGPT were quite specific and even included some boilerplate remediation guidance, ChatGPT’s responses still tended to lack “understanding” of accessibility. For example, when given a form `<input>` that has no label but that does have a `placeholder`, ChatGPT used the `placeholder` value to construct a label that was explicitly associated with the field while also keeping the existing `placeholder`. While this is an effective solution to the error, the best solution to the error would also require either the removal of the `placeholder` or changing the `placeholder` value to resemble the expected input.

The above example is relatively minor but highlights the fact that ChatGPT lacks “opinion”. The GIGO principle applies strongly when using ChatGPT. It frequently left other accessibility issues behind – or created new ones – when resolving an issue. For example, for images that lacked an `alt` attribute but did have a `title`, ChatGPT would use the `title` attribute value as the value for a newly inserted `alt` attribute and would keep the `title`. The result disagrees with what my guidance would be to a customer because it risks creating a situation where the same text is repeated by text-to-speech software. Worse still are instances where it would add innerText to a link that contains an image as well as using that same text to construct an `alt` attribute. While the former example’s duplicated read is subject to user settings, the latter is not.

The issues from which the prompts were derived were hand-picked by me as distinct examples of things that I expected to be fixable via automation. Of the 79 prompts I fed into ChatGPT, I cataloged problems in 41 of the responses:

  • Fixed, but did not recognize other issue in the error code that needed repair: 11
  • Inaccurate, probably because it did not recognize the intent of the code: 9
  • Guidance is acceptable but not opinionated: 8
  • Following the guidance would create new accessibility issues: 6
  • Wholly inaccurate guidance: 7

An error rate of 52% in any process is unacceptable, even for high volume tasks performed by humans, which tends to be 10% – or even lower – for many tasks. For example, exceeding 3% error rate on captioning means users will barely be able to comprehend the main concepts and facts presented.

Remediation is still a task for humans

While I’m personally excited about the future prospects of using AI for creating accessible websites and software, I remain skeptical that it will be of much use in repairing accessibility errors in existing code. Doing so requires understanding the overall context of the issues found and requires the ability to form opinions about the best approach to take to fix the issues, given that context.