Measuring the harm of flawed academic papers

For several years I’ve been interested in finding and reading academic work in the field of web accessibility. I have a very strong belief that the things we say regarding web accessibility must be based on a significant amount of rigor and I hold in higher esteem those who base their statements on fact rather than opinion and conjecture. Unfortunately I often find that much of the academic work in web accessibility to be deficient in many ways, likely caused by a lack of experiential knowledge of the professional web development environment. Web development practices change at such a lightning fast pace that even professional developers have trouble keeping up on what’s new. Academics who themselves aren’t really developers in the first place are likely to have even greater trouble in understanding not only the causes of accessibility issues in a web-based system but how to test for those causes. I deal specifically with those topics 8-10 hours a day and sometimes I still have to turn to coworkers for advice and collaboration.

The reason this matters is because out-of-date knowledge and experience leads to issues with the research methods being also out of date. The most obvious evidence of this is when web accessibility researchers perform automated testing with tools that are out of date and/ or technically incapable of testing the browser DOM. Testing the DOM is a vital feature for any accessibility testing tool, especially when used in academic research, because the DOM is what the end user actually experiences. It matters even more when studying accessibility because the DOM is interpreted by the accessibility APIs which pass information about content and controls to the assistive technology employed by the user. Performing research with a tool that does not test the DOM is like measuring temperature with a thermometer you know to be broken. You have no chance of being accurate.

Recently I’ve been reading a paper titled “Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests”. This compellingly titled paper fails to show any instances of “sole reliance” on automated tests and further it fails to demonstrate where such sole reliance caused actual “harm” to anyone or anything. Instead, the paper reads as if it was research performed to validate a pre-determined conclusion. In doing so, the paper’s authors missed an opportunity at a much more compelling discussion: the vast performance differences between well-known accessibility testing tools. The title alludes to this, saying “Benchmarking Web Accessibility Evaluation Tools” and then proceeds to instead focus on these ideas of “harm” and “sole reliance” while using bad results from bad tools as its evidence.

This point – that testing with automated tools only is bad – is so obvious that it almost seems unnecessary to mention. I’ve worked in accessibility and usability for a decade and many of those years were as an employee of companies with make automated testing tools. I’ve also developed my own such tools and count among my friends those who have also developed such tools. Not once do I recall the employees, owners, or developers of any such tools claiming that their automated testing product provides complete coverage. Training materials delivered by SSB BART Group and Deque Systems disclose clearly that automated testing is limited in its capability to provide coverage of all accessibility best practices. So, if “sole reliance” on automated testing is actually an issue, a better title for this paper would be “Measuring the Harm of Incomplete Testing Methodologies.” Instead, the reader is presented with what amounts to an either-or proposition by constant mention of the things that the automated tools couldn’t find vs. what human evaluators found. Thus the paper implies that either you use an automated tool and miss a bunch of stuff or you have an expert evaluate it and find everything.

This implication begins even in the first paragraph of the Introduction by stating:

The fact that webmasters put compliance logos on non-compliant websites may suggest that some step is skipped in the development process of accessible websites. We hypothesise that the existence of a large amount of pages with low accessibility levels, some of them pretending to be accessible, may indicate an over-reliance on automated tests.

Unfortunately, nowhere else in the paper is any data presented that suggests the above comments have any merit. The fact that “…webmasters put compliance logos on non-compliant websites” could mean the sites’ owners are liars. It could mean the site was at one time accessible but something changed to harm accessibility. It could mean the sites’ owners don’t know what accessibility means or how to measure it. In fact, it could mean almost anything. Without data to back it up, it really means nothing and is certainly no more likely to be evidence of “over-reliance on automated tests” as it is any of the other possibilities. Instead the reader is left with the implied claim that it is this “over-reliance on automated tests” that is the culprit.

Further unproved claims include:

With the advent of WCAG 2.0 the use of automated evaluation tools has become even more prevalent.

This claim is backed up by no data of any kind. The reader is given no data from surveys of web developers, no sales figures of tools, no download numbers of free tools, not even anecdotal evidence. Instead, it continues:

In the absence of expert evaluators, organizations increasingly rely on automated tools as the primary indicator of their stated level.

And again no data is supplied to substantiate this claim. In fact, my emperical data gained from dealing with over seven-doze clients over the last decade suggests that organizations often don’t do any testing of any kind, much less automated testing. These organizations also tend to lack any maturity of process regarding accessibility in general, much less accessible development, manual accessibility testing, or usability testing. My experience is that organizations don’t “do accessibility” in any meaningful way, automated or not. The true smoking gun, as it were, for this so-called harm by “sole reliance” on automated testing could be made simply by supplying the reader with actual data surrounding the above claim. It is not supplied and there is no evidence that such data was even gathered.

Another issue with this paper is its nearly myopic discussion of accessibility as a topic concerned only with users who are blind. The most egregious example comes in the claim, referencing prior work (from 2005), that “Findings indicate that testing with screen readers is the most thorough, whilst automated testing is the least”. Later the paper states that during the expert evaluation process that, “If no agreement was reached among the three judges a legally blind expert user was consulted.” While this is follow by a claim that this person is also a web accessibility expert, the paper states that “This protocol goes further and establishes a debate between judges and last resort consultation with end users.” I don’t consider the experience of a single blind user to be the same as “users” and further do not consider it likely that this single expert user’s opinion would represent the broad range of other blind users much less all users with all disabilities. In the United States, the overall rate of disability for vision impairment and hearing impairment is roughly equal, while those with mobility impairments are more than double both of those combined. Cognitive disabilities account for a larger population than the previous three types combined. Clearly the opinion, however skilled, of a single person who is blind is in no way useful as a means measuring the accessibility of a website for all users with disabilities.

Further problems with the expert evaluation have to do with the ad-hoc nature of the expert evaluation process:

The techniques used to assess the accessibility of each web page are diverse across judges: evaluation tools that diverge from the ones benchmarked (WAVE2), markup validators, browser extensions for developers (Firebug, Web Developer Toolbar, Web Accessibility Toolbar), screen readers (VoiceOver, NVDA) and evaluation tools based on simulation such as aDesigner[24]

The above passage betrays two rather significant flaws in both the paper itself and the evaluation process. The first is the rather humorous irony that some of the tools listed are by their nature automated testing tools. Both the WAVE web service and WAVE toolbar provide visual representation of automated test results for the page being tested. Markup validators are automated evaluation tools which access the code and automatically assess whether the markup itself is valid. In other words, the expert evaluation process used automated tools. In the process, the point is made that no skilled evaluator would solely rely on the results from automated tools. Adding to the irony, there is no discussion of any other evaluation methods other than testing with screen readers. This further adds to my argument that this paper has a myopic focus on blindness. The second and more important flaw is that there appears to have been no predefined methodology in place for their evaluation. Instead it appears to be assumed that either the reader will trust that the reviewers’ expertise speaks for itself or that a rigorous methodology is unnecessary. Regardless of why, the fact that the paper doesn’t supply a detailed description of the expert evaluation methodology is cause to question the accuracy and completeness of, at the very least, the results of such evaluation.

If the purpose of the paper is to evaluate what is found by machines measured against the results uncovered by expert evaluators, then it is critical that the human evaluation methods be disclosed in explicit detail. Based on the information provided, it would appear that the expert evaluation happened in a much more ad hoc fashion, with each expert performing testing in whatever fashion they deem fit. The problem with this practice is that regardless of the level of expertise of the evaluators, there will always be differences in what & how the testing was done. The importance of this cannot be overstated. This is a frequent topic of discussion at every accessibility consulting firm I’ve worked for. The number and kind(s) of problems discovered can vary significantly depending upon who does the testing and the looser the consulting firm’s methodology (or lack thereof in some cases), the more variance in what is reported. In fact, at a previous job one client once remarked “I can tell who wrote which report just based on reading it”. This, to me, is a symptom of a test methodology that lacks rigor. On the upside paper does describe a seemingly collaborative workflow where the judges discuss the issues found, but this is still not the same as having and following a predefined rigorous methodology. The presence of a rigorous methodology of manual testing would be even further strengthened by the judges’ collaboration.

In this same section on Expert Evaluations, the paper states that “Dynamic content was tested conducting usability walkthroughs of the problematic functionalities…” and yet the process of conducting these “usability walkthroughs” was not discussed. The paper does not discuss how many participants (if any) took part in these usability walkthroughs and does not disclose any details on any of the participants, their disabilities, their assistive technologies, and so on. Again, the reader is expected to assume this was performed with rigor.

Exacerbating the above, the paper does not provide any details on what the expert evaluation discovered. Some of this data is posted at http://www.markelvigo.info /ds/bench12 but the data provided only discloses raw issue counts and not specific descriptions of what, where, and why the issues existed. There is also no discussion of the severity of the issues found. While I realize that listing this level of detail in the actual paper would be inappropriate, sharing the results of each tool and the results of each expert evaluator at the URL mentioned above would be helpful in validating the paper’s claims. In fact, the expert evaluation results are invalidated as a useful standard against which the tools are measured by stating that:

Even if experts perform better than tools, it should be noted that experts may also produce mistakes so the actual number of violations should be considered an estimation…

If the experts make mistakes and the likelihood of such mistakes is so high that “…the actual number of violations should be considered an estimation…” then the results discovered by these persons is in no way useful as a standard for the subsequent benchmarking of the tools. Remember, the purpose of this paper is to supply some form of benchmark. You can’t measure something against an inaccurate benchmark and expect reliable or useful data.

The description of the approach for automated testing does not disclose the specific versions of each tool used or the dates of the testing. The paper also does not disclose what level of experience the users of the tools had with the specific tools or what, if any, configuration settings were made to the tool(s). The tool version can, at times, be critical to the nature and quality of the results. For instance, Deque’s Worldspace contained changes in Version 5 that were significant enough to make a huge difference between the results of testing with it and its predecessor. Similarly, SSB BART Group’s AMP is on a seasonal release schedule which has in the past seen big differences in testing. Historically, automated testing tools are well-known for generating false positives. The more robust tools can be configured to avoid or diminish this but whether this was done is not known. Not disclosing details on the testing tools makes it difficult to verify the accuracy of the results. Were the results they found (or did not find) due to flaws in the tool(s), flaws in the configuration of the tools, or flawed use of the tools? It isn’t possible to know whether any of these possible factors influenced the results without more details.

To that point, it also bears mentioning that some of the tools used in this paper do not test the DOM. Specifically, I’m aware that TAW, TotalValidator, and aChecker do not test the DOM. SortSite and Worldspace do test the DOM and it is alleged that latest versions of AMP does as well. This means that there is a built-in discrepancy between what the tools employed actually test. This discrepancy in what the tools test quite obviously leads to significant differences in the results delivered and, considering the importance of testing the DOM, calls into question the reason for including half of the tools in this study. On one hand it makes sense in this case to include popular tools no matter what, but on the other hand it seems that using tools that are known to be broken sets up a case for a pre-determined conclusion to the study. This skews the results and ensures that more issues are missed than should be.

The numerous flaws discussed above do not altogether make this paper worthless. The data gathered is very useful in providing a glimpse into the wide array of performance differences between automated testing tools. The issues I’ve discussed above certainly invalidate the paper’s claim that it was a “benchmark” study, but it is nonetheless compelling to see differences between each tool, especially in discussing that while a tool may out-perform its peers in one area, it may under-perform in other ways even more significantly. The findings paint a picture of an automated testing market where tool quality differs in wild and unpredictable ways which a non-expert customer may be unprepared to understand. Unfortunately the data that leads to these various stated conclusions isn’t exposed in a way that facilitates public review. As mentioned, some of this data is available at http://www.markelvigo.info byt/ds/bench12. It is interesting to read the data on the very significant disparities between the tools and also sad that it has to be presented in a paper that is otherwise seriously flawed and obviously biased.

An unbiased academic study into the utility and best practices of automated testing is sorely needed in this field. I’ve attempted my own personal stab at what can be tested and how and I stand by that information. I’ve attempted to discuss prioritizing remediation of accessibility issues. I’ve recommended a preferred order for different testing approaches. At the same time, none of this is the same as a formal academic inquiry into these topics. We’ll never get there with academic papers that are clearly driven by bias for or against specific methodologies.

Update

Markel Vigo has updated the URL I’ve cited at which you can find some of the data from the paper with a response to this blog post. Like him, I encourage you to read the paper. In his response, he says:

We do not share our stimuli used and data below by chance, we do it because in Science we seek the replicability of the results from our peers.

My comments throughout this blog post remain unchanged. The sharing of raw issue counts isn’t enough to validate the claims made in this paper. Specifically:

There is no data to substantiate the claim of “sole reliance” on tools
There is no data to substantiate the claim of “harm” done by the supposed sole reliance
There is no data shared on the specific methodology used by the human evaluators
There is no data shared regarding the exact nature and volume of issues found by human evaluators
There is no data shared regarding the participants of the usability walkthroughs
There is no data shared regarding the exact nature and volume of issues found by the usability walkthroughs
There is no information shared regarding the version(s) of each tool and specific configuration settings of each
There is no data shared regarding the exact nature and volume of issues found by each tool individually
There is no data shared which explicitly marks the difference between exact issues found/ not found by each tool vs. human evaluators

It is impossible to reproduce this study without this information.

In his response, Markel states that this blog posts makes “…serious accusations of academic misconduct…”. I have no interest in making any such accusations against any person. Throughout my life I’ve apparently grown the rare ability to separate the person from their work. I realize that my statement about this paper’s bias can be interpreted as a claim of academic misconduct, but that’s simply an avenue down which I will not travel. Markel Vigo has contributed quite a bit to the academic study of web accessibility and I wouldn’t dare accuse him or the other authors of misconduct of any kind. Nevertheless, the paper does read as though it were research aimed at a predetermined conclusion. Others are welcome to read the paper and disagree.

Finally, the response states:

Finally, the authors of the paper would like to clarify that we don’t have any conflict of interest with any tool vendor (in case the author of the blog is trying to cast doubt on our intentions).

Let me be clear to my readers: Nowhere in this blog post do I state or imply that there’s any conflict of interest with any tool vendor.

My company, AFixt exists to do one thing: Fix accessibility issues in websites, apps, and software. If you need help, get in touch with me now!