Automated Web Accessibility Testing Tools Are Not Judges
Recently social media has been abuzz regarding an article titled “ITIF: 92% of Top Federal Websites Fail to Meet Security, Speed, Accessibility Standards” – and for good reason. The article cites a study by ITIF which details rampant failings of websites of the US Government. American taxpayers, being both the audience and source of funding for these systems, have every right to expect those websites to be user-friendly, secure, and accessible. ITIF is to be applauded for doing this rigorous research and reporting this information.
But there’s another, more important thing to understand about this type of exercise: Automated Web Accessibility Testing Tools Are Not Judges.
Before I continue, I want to make sure that first-time readers understand my background on this topic. I have a long history with testing tools. In fact, my introduction to accessibility started with tools (as described here). As my resume shows, I worked for SSB BART Group and Deque, two of the major players in the accessibility testing tools market. I contributed significantly to the development of SSB’s AMP product. I’m the founder of Tenon.io and I’ve been doing accessibility consulting, accessibility testing, training, and web development for over a decade-and-a-half. It is – quite literally – my job to know what can and cannot be done with automated accessibility testing tools and I seek to stretch those boundaries every chance I can.
This cannot be said often enough or loudly enough: There’s just too many things in Accessibility that are too subjective and too complex for a tool to test with enough accuracy to be considered a judgment of the system’s level of accessibility. An automated testing tool cannot even tell with 100% certainty whether or not a web page passes WCAG 2.0 1.1.1.
1.1.1 Non-text Content: All non-text content that is presented to the user has a text alternative that serves the equivalent purpose…
Read that again. “All non-text content”, which is defined as:
any content that is not a sequence of characters that can be programmatically determined or where the sequence is not expressing something in human language
Note: This includes ASCII Art (which is a pattern of characters), emoticons, leetspeak (which uses character substitution), and images representing text
The non-text content must have “… a text alternative that serves the equivalent purpose…”. Not only that, but the text must be programmatically determinable. If your text alternative is not truly equivalent, your document is non-conforming.
Think about this for a second. There are a lot of cool emerging technologies that allow computers to recognize the objects in an image. We can detect text inside of an image and we can even locate specific people’s faces. Add a lot of complexity to an image, and most computer vision falls apart. Even OCR, which has existed for around 50 years, can’t read the logo for a Death Metal band. But even if computer vision was perfect, we can’t determine why any specific “non-text content” was chosen. What was the web author trying to convey with that non-text content? Why was that specific non-text content chosen over other options? What benefit will that non-text content have for the user who can see it? Is that non-text content there for decoration, or is it critical to the content? Tenon has 25 tests for WCAG 1.1.1 and we have maybe a dozen more on our wishlist. Even then, we’ll never be able to determine the meaning that the user intended to convey via non-text content and therefore we’ll never be able to fully judge conformance with 1.1.1.
There’s only so much that can be tested for automatically. Some things are so subjective or complex that trying to test for them would result in so many false positives vs. accurate findings that having the test would do more harm than good (like testing for 1.3.3). The team over at the UK Government Digital Service have done a great job of demonstrating the limitations of automated tools.
One thing that is missing across all automated testing tools, is contextual understanding. Automated accessibility testing tools provide the ability to test against highly specific heuristics that are very tightly scoped. They have no insight into the broader document being tested. They have no ability to determine the specific purpose of the page as a whole. I have been present at usability tests where test scenarios failed because of one highly important technical failing that – within its own context – was relatively minor. Nevertheless, it caused all participants to fail the test scenario.
What can you base a grade on?
Given the above, we’re already at a significant disadvantage if we rely on an automated tool. Again, we can’t even definitively prove conformance with any given WCAG SC, so grading any specific SC as a “pass” is wholly impossible. This is important to consider, because it means that even if your tool of choice returns zero errors, you can still be non-conformant. A picture of a dog, with an alt attribute of “cat” will pass all automated accessibility tools, despite being completely inaccurate even in terms of what is displayed in the image.
At this point, the only thing you can “grade” is one document’s failures against another’s – turning this from an exercise in deriving an absolute grade to an exercise in deriving a grade relative to other documents. An absolute claim of conformance would require an ability to fully test all criterion necessary for conformance. For instance, if a product contains a mark from Underwriters Laboratories it means the product has been found to conform to all of the defined safety standards for that type of product. Because no automated accessibility testing tool can completely test the criteria for WCAG conformance, no absolute “grade” derived from an automated tool is at all relevant or accurate which is why, at best, we’re left with relative grading.
Once you’ve settled on relative grading, what do you base that grade on?
- Using tests passing vs. tests failing fails to consider the volume and severity of the issues found for the tests that did fail.
- If you base it on issues per page, you’re failing to consider complexity differences between tested pages. You’re also failing to consider the severity of each issue.
- You can break down issues-by-page-by-severity, but the severity of an issue varies significantly based on the type of user impacted. Bad alt attributes don’t impact users who are color blind. A lack of keyboard accessibility doesn’t impact a mouse user or a voice dictation user who can still use Dragon’s mouse grid. The priority score calculated by Tenon handles this pretty well, but the truth remains that a generic measure of severity fails to consider the full specific nature of an issue.
- When it comes to complexity differences of each page, you can try to use Issue Density – that is, how many issues exist per kilobyte of document source. We’ve even proposed a mechanism for doing that. Unfortunately relative grading based on issue density is of little value, because even with a statistically significant sample size, standard deviation is ridiculously high.
There are certainly ways to combine metrics in a way that can be used to perform relative scoring of one site against the next but it would require:
- Testing all pages of each site.
- Using a baseline derived from a statistically significant-sized sample of all other sites on the web and testing every page of those sites.
- Gathering and measuring the relevant metrics (assuming your tool gives you the necessary granularity, such as issue-density-by-severity-by-affected-population).
- Using a tool that uses the browser DOM, tests accurately and without false positives.
And even then, all you’d be left with is – at best – a “sniff test” based on the comparatively small number of things that an automated testing tool can test for. For the most part, you’d still have absolutely no idea how any of the sites performed for users who are deaf or hard-of-hearing, for instance.
Automated accessibility testing tools are not judges
Given all I’ve said above it seems the only reasonable conclusion that using an automated web accessibility testing tool to “grade” websites’ accessibility is an exercise in futility. The only claim that you could make is that “given the things that can be tested for by tool x, the following issues surfaced…”. In other words, automated testing tools are great at diagnosing issues.
Automated accessibility testing tools have an important role to play in ensuring that web-based systems are accessible. They offer a level of efficiency that cannot be matched by humans and they perform at a scale no amount of humans could match. Since launching in 2014:
- Tenon averages over 1000 customer test runs per day (including holidays and weekends).
- Every day, we have non-bot traffic 24/7/365 across every timezone on the planet from 66 distinct countries.
- Tenon has logged over 51,000,000 distinct issues across more than 1 million distinct URLs on approximately 30,000 distinct domains.
The real power of automated website accessibility testing tools is in their ability to quickly and accurately detect the specific issues they were programmed to find. They are a vital component of any robust website QA testing process. They cannot be used to “judge” or “grade” the level of accessibility of a system.