At the end of 2020, ISO published the technical report ISO/IEC TR 29119-11. Its title is “Guidelines on the testing of AI-based systems”.
A lot of companies want to know how to test their devices based on artificial intelligence (AI) techniques according to the state of the art. That’s why ISO/IEC TR 29119-11 has set out to provide concrete guidance on how to do this and thus to describe the state of the art. But does the standard achieve this aim?
If you can’t wait for the answer, go straight to section 3 “Summary”.
a) Scope of application
ISO/IEC 29119-11 is part of a series of standards on software testing. For example, part 2 describes the test processes, and part 4 the test techniques. Part 11, which we’re going to look at here, is intended to provide guidance for testing AI-based software, irrespective of the
b) Structure of the standard
ISO 29119-11 is 60 pages long and consists of 10 sections and an annex.
Although the title of the first section is “Scope”, it spends more time describing the goals and contents of the standard.
The standard does not mention any normative references.
The third section contains 88 definitions. That’s a lot. But there is largely no reference to sources. And some definitions are surprising to say the least. For example, experts in machine learning understand the term “false positive” somewhat differently to the authors, who define the term from the software tester’s point of view:
“incorrect reporting of a pass when in reality it is a failure”
ISO/IEC TR 29119-11 Section 2
The very high-level introduction to section four introduces, among other things, use cases, types of AI models, AI frameworks such as Tensorflow, and regulatory standards. Opinions may be divided on the timeliness, content and selection of these lists.
There are almost no regulations relevant to medical devices.
Read more on the regulatory requirements for the use of ML in medical devices.
The authors introduce the quality model described in ISO 25010 and come to the surprising conclusion:
However, AI-based systems have some unique characteristics that are not contained with this quality model, such as flexibility, adaptability, autonomy, evolution, bias, transparency/interpretability/explainability, complexity and non-determinism.
ISO/IEC TR 29119-11 Section 4
ISO/IEC TR 29119-11 even calls these attributes “non-functional characteristics.”
The assessment “not contained within this quality model” is hard to understand for several reasons:
The authors of ISO/IEC 29119-11 see the creation of specifications as particularly challenging. For example, the desired system output is often not known, which is also a problem when creating the test oracle.
A medical device manufacturer who tried to claim anything like this would be likely to run into problems with their notified body. The manufacturer would have to be able to specify precisely for test data whether the system should, for example, be capable of detecting cancerous tissue on a specific CT image or not. The manufacturer has to specify this during labeling.
ISO 29119-11 considers the concepts of unit (German), integration and system tests (German)to be transferable to software that contains AI components. That is understandable. The standard does not give any specific instructions for how to perform these tests; it refers to the later sections.
With regard to unit testing, ISO 29119-11 does not distinguish between
From a software engineering perspective, this makes sense. From a regulatory perspective, however, the two cases have to be differentiated. ISO 13485 and IEC 62304 mean that different standards are actually applicable.
In this context, pay attention to the regulatory requirements of ISO 13485 on computerized systems validation.
The seventh section, unlike the other sections, only talks about machine learning and not artificial intelligence. The standard does not reveal the reasons for this switch.
Anyone hoping to find concrete guidance on how to test ML systems in this section will be disappointed. The authors devote two or three relatively general sentences and a sub-section each to various aspects such as “test data quality.”
To use one example, instead of guidance, the subsection on adversarial attacks contains a description of these attacks but no information on how you can test the robustness (ISO 25010 criterion) of a system against these attacks.
Section 8 is also fairly superficial. For example, the sub-section on combinatorial testing describes in a few sentences what this involves. However, ISO 29119-11 does not give any details on what specifically needs to be done and how or whether combinatorial testing can be used for image data.
You can find a description of combinatorial testing here (German)
Some information on the following can be found in other subsections:
This section is indeed called “White-box testing of neural networks” and not “White-box testing of AI based systems.” It’s not clear why the authors restrict white-box testing to neural networks only.
This section briefly explains what neural networks, neurons, weights and hidden layers are. It then focuses on test coverage. It introduces several metrics:
ISO 29119-11 dedicates a couple of sentences to each one. However, the standard does not explain what validity these metrics have, either based on the input (image, table data, texts) or on the architecture of the neural network. This would have been helpful, especially since a lot of developers don’t rely only on fully-connected layers.
The authors do not address the role of activation functions (German) either. That a neuron is defined as activated when its output is greater than zero may be true for a RELU function, but this approach will fail with a sigmoid function.
References to corresponding literature are also largely missing.
The last section discusses test environments. It gives some advantages of virtual test environments.
One suggestion is helpful: the test environments should be based on the specific problems (accident reports, issues) not just the system requirements.
However, for medical devices, it is not primarily post-market data that should determine test environment selection but risk management.
The standard does not describe how manufacturers should draw conclusions about the test environment from, for example, accident reports.
ISO IEC TR 29119-11 is – as its name already indicates – a technical report. It is common for technical reports to provide background information as well as specific requirements.
a) Non-specific and missing guidance
For a technical report, ISO/IEC TR 29119-11 stays at a very superficial level. The standard can act as an introduction even if a lot of it is explained more precisely, in more detail and more comprehensibly elsewhere.
Because ISO/IEC TR 29119-11 only touches on all the concepts, readers are not told what they can actually do. For example, the standard states that the quality of ML systems depends on the quality of the test and training data. This should come as no surprise to anyone. The standard requires:
The selection of training data in terms of the size of the dataset and characteristics such as bias, transparency and completeness should be documented and justified and confirmed by experts where the level of risk associated with the system warrants it (e.g., for critical systems).
ISO/IEC TR 29119-11 Section 7.5
What specifically should manufacturers do now?
Some sentences are almost tautologous. For example:
a system can be tested for bias by the use of independent testing using bias-free testing sets
ISO/IEC TR 29119-11 Section 6.1.8
b) Incomplete
No standard can claim to be exhaustive. The extent to which ISO 29119-1 leaves out existing knowledge is nevertheless astonishing. For example:
c) Not always comprehensible structure
Both the “macroscopic level” of the standard (e.g., section structure) and its “microscopic level” (individual sentences and bulleted lists, use of terms) raise doubts about the standard’s conceptual integrity.
d) Relevance
How relevant the predictions on the use AI given to two decimal places and dating from 2018 will be in a few years’ time is open to debate.
The same is true for the coverage levels selected and the focus on neural networks.
A lot of implicit restrictions (e.g., to some of the activation functions) further narrow the actual scope of the standard.
People who compile knowledge, structure it, and shape it into a standard deserve our recognition and thanks. This work is mostly done on a voluntary basis.
ISO 29119-11 looks like it was written by people with in-depth knowledge of software engineering. It doesn’t exude expertise in machine learning to the same degree.
To charge CHF 178.00 for a standard of this quality doesn’t seem reasonable.
For medical device manufacturers and auditors, it should be clear:
Die ISO/IEC TR 29119-11 does NOT describe the state of the art. Therefore, it should NOT be requested during audits.
Whether you want to spend your money buying the standard and then take the time to read it or would rather invest both in testing your AI-based systems is an individual decision.
In a future article, the Johner Institute will provide some concrete guidance on testing ML-based software. It has already published a comprehensive guide in the form of its free AI guidelines that are also used, in modified form, by notified bodies and will be used as the basis for a future WHO guideline.