Regulatory Requirements for Medical Devices with Machine Learning

Manufacturers of medical devices with machine learning are faced with the difficult task of having to demonstrate the conformity of their devices.

This is challenging for many manufacturers. They know the laws, but which standards and best practices do they need to pay attention to in order to demonstrate conformity and be able to talk to authorities and notified bodies on an equal footing?

This article will save you hours of research. It will provide you with an overview of the most important regulations and best practices that you need to know about, saving you hundreds of pages of reading as a result.

If you pay attention to these regulations, you can be perfectly prepared for the next audit.

1. Legal requirements for the use of machine learning in medical devices

a) MDR and IVDR

There are currently no laws or harmonized standards that specifically regulate the use of machine learning in medical devices. Obviously, these devices have to comply with existing regulatory requirements set out in the MDR and IVDR, such as:

  • The manufacturers must demonstrate the benefit and performance of the medical device. For devices that are used for diagnostic purposes, the diagnostic sensitivity and specificity, for example, must be demonstrated.
  • In Annex I, the MDR obliges manufacturers to ensure the safety of their devices. This includes ensuring that the software has been developed in a way that ensures repeatabilityreliability and performance (see, for example, MDR Annex I, 17.1 and IVDR Annex I, 16.1, respectively).
  • The manufacturer must provide a precise intended purpose (MDR/IVDR Annex II) and validate the device against the intended purpose and stakeholder requirements and verify it against the specifications (see, for example, MDR Annex I, 17.2 and IVDR Annex I, 16.2, respectively). They must also describe the methods they will use to do this.
  • If the clinical evaluation is based on a comparator device, this device must be technically equivalent. Demonstrating this equivalence explicitly requires an evaluation of the software algorithms (MDR Annex XIV, Part A, Section 3). This is even more difficult in the case of the performance evaluation of IVD medical devices. A clinical performance study can only be omitted in well-justified cases (IVDR Annex XIII, Part A, Section 1.2.3).
  • The development of software that will be part of the device must take into account “the principles of development life cyclerisk management, including information security, verification and validation” (MDR Annex I, 17.2 and IVDR Annex I, 16.2, respectively).

b) (Harmonized) standards without specific reference to machine learning

The MDR and IVDR allow conformity to be demonstrated using harmonized standards and “common specifications.” For medical devices that use machine learning techniques, manufacturers should observe the following standards:

  • ISO 13485:2016
  • IEC 62304
  • IEC 62366-1
  • ISO 14971
  • IEC 82304

These standards contain specific requirements that are also relevant for medical devices with machine learning, e.g.:

  • The development of software for data collection and processing, for labeling, and for the training and testing of models must be validated (computerized systems validation (CSV) according to 13485:2016 4.16 ).
  • Before development, manufacturers must determine and ensure the competence of the people involved (ISO 13485:2016 7.3.2 f).
  • IEC 62366-1 requires that manufacturers precisely characterize the intended users, planned use environment, and intended patients, including their indication and contraindication.
  • Manufacturers who use software libraries (which is almost always the case for software with machine learning) must specify and validate these libraries as SOUP/OTS (IEC 62304).

Further information

Please read the article on the validation of ML libraries.

c) USA: FDA

The FDA has established comparable requirements, especially in 21 CFR part 820 (including part 820.30 on design controls). Numerous FDA guidance documents, including the documents on “software validation”, the use of off-the-shelf software (OTSS) and cybersecurity, are mandatory reading for companies that want to sell medical devices that are or contain software in the USA.

The FDA draft “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)” is also mandatory reading.

Further information

You can find a detailed description of this framework in the article on “Artificial Intelligence in Medicine.”

d) China: NMPA

The Chinese NMPA has released the draft of the “Technical Guiding Principles of Real-World Data for Clinical Evaluation of Medical Devices” for comment.

However, the document is currently only available in Chinese. But we have had the table of contents translated automatically for you.

China-NMPA-AI-Medical-DeviceDownload

The document addresses:

  • Requirements analysis
  • Data collection and processing
  • Design of the model
  • Verification and validation (also clinical validation)
  • Post-market surveillance

The authority is also building up its staff and has established an “AI Medical Device Standardization Unit”. This unit is responsible for the standardization of terminology, technology and processes for development and quality assurance.

e) Japan

The Japanese “Ministry of Health, Labour and Welfare” is also working on AI standards. Unfortunately, the authority only publishes the progress reports on these efforts in Japanese. (Translation programs will help though.) No concrete results have been published yet.

2. Standards and best practices relevant to machine learning

a) “Artificial Intelligence in Healthcare” from the COICR

 The COICR published the document “Artificial Intelligence in Healthcare” in April 2019. It refers to existing requirements rather than providing new ones and recommends the development of standards.

Conclusion: not very helpful

b) IEC/TR 60601-4-1

Technical Report IEC/TR 60601-4-1 gives guidance for “Medical electrical equipment and medical electrical systems employing a degree of autonomy.” This guidance, however, is not specific to medical devices that use machine learning.

Conclusion: slightly helpful

c) “Good Practices” from the Xavier University

The Xavier University has published the document “Perspectives and Good Practices for AI and Continuously Learning Systems in Healthcare.”

As the title makes clear, it is (also) about continuously learning systems. Nevertheless, many of the best practices mentioned can also be transferred to systems that do not learn continuously:

  • Define the performance requirements at the start
  • Collect information and gain understanding of how the system learns over time
  • Use a professional software development process, including verification and validation
  • New data that causes the system to learn/change should be subject to systematic quality control
  • Establish limits within which the algorithm can change over time
  • Define what can trigger changes to the algorithm
  • Develop the system so that it monitors its own performance in real time and reports the results to the user at regular intervals
  • Give users the ability to reject an algorithm update and/or roll back to a previous algorithm version
  • Users should be informed every time the learning has caused a significant change in behavior, and the change is clearly described
  • Make it clear how an algorithm has evolved and how it has reached a decision

This traceability/interpretability, in particular, is a challenge for many manufacturers.

Further information

The training videos in Auditgarant introduce other important techniques, such as LRP LIME,  the visualization of neural network activation and counterfactuals.

The document also discusses exciting questions, such as whether patients have to be informed when an algorithm has been updated and could come to a better or even a different diagnosis.

The guidelines contained in this document have been incorporated in the Johner Institute's AI guidelines.

Conclusion: helpful, especially for continuously learning systems

d) “Building Explainability and Trust for AI in Healthcare” from the Xavier University

This document from the Xavier University, which the Johner Institute helped draft, looks at best practices in the field of explainability. It provides useful guidance on which information has to be provided, for example, for “technical stakeholders”, in order to meet these explainability requirements.

Conclusion: at least partially helpful

e) “Machine Learning AI in Medical Devices” from the BSI and AAMI

The title of this BSI/AAMI document sounds promising. But, ultimately, it is only a position paper that you can download free of charge from the AAMI store. The position paper calls for the development of new standards with the involvement of the BSI and AAMI. Results are expected by the end of 2020.

Conclusion: not very helpful

f) DIN SPEC 92001-1:2019-04

The standard DIN SPEC 92001 “Artificial Intelligence – Life Cycle Processes and Quality Requirements – Part 1: Quality Meta Model” is also available for free download.

It presents a meta-model but does not give any specific requirements for the development of AI/ML systems. The document is not specific to any particular sector.

Part 2: Technical and Organizational Requirements is currently not available.

Conclusion: not very helpful

g) ISO/IEC CD TR 29119-11

The standard ISO/IEC CD TR 29119-11 “Software and systems engineering – Software testing – Part 11: Testing of AI-based systems” is still under development.

Conclusion: still too early, worth keeping an eye on

h) Syllabus from the Korean “Software Testing Qualification Board”

The Korean “Software Testing Qualification Board” has made a syllabus for testing AI systems entitled “Certified Tester AI Testing  Testing AI-Based Systems (AIT – TAI) Foundation Level Syllabus” available for download.

From chapter 3.8, the syllabus provides information on quality assurance for AI systems, which can mostly also be found in the Johner Institute’s guidelines.

In addition, chapter 6 of the document contains guidelines for the black box testing of AI models, such as combinatorial testing and “metamorphic testing”. The tips on neural network testing, for example, using “neuron coverage” and tools such as DeepXplore, are particularly worth looking at.

Conclusion: recommended

i) ANSI/CTA standards

The ANSI has published several standards together with the CTA  (Consumer Technology Association):

As the titles suggest, the standards provide definitions. Nothing more and nothing less.

The CTA is currently working on additional specific standards, including one on “trustworthiness”.

Conclusion: only helpful as a collection of definitions

j) IEEE standards

The IEEE is currently working on a whole family of standards:

  • P7001 – Transparency of Autonomous Systems
  • P7002 – Data Privacy Process
  • P7003 – Algorithmic Bias Considerations
  • P7009 – Standard for Fail-Safe Design of Autonomous and Semi-Autonomous Systems
  • P7010 – Wellbeing Metrics Standard for Ethical Artificial Intelligence and Autonomous Systems
  • P7011 – Standard for the Process of Identifying and Rating the Trustworthiness of News Sources
  • P7014 – Standard for Ethical considerations in Emulated Empathy in Autonomous and Intelligent Systems
  • 1 – Standard for Human Augmentation: Taxonomy and Definitions
  • 2 – Standard for Human Augmentation: Privacy and Security
  • 3 – Standard for Human Augmentation: Identity
  • 4 – Standard for Human Augmentation: Methodologies and Processes for Ethical Considerations
  • P2801 – Recommended Practice for the Quality Management of Datasets for Medical Artificial Intelligence
  • P2802 – Standard for the Performance and Safety Evaluation of Artificial Intelligence Based Medical Device: Terminology
  • P2817 – Guide for Verification of Autonomous Systems
  • 1.3 – Standard for the Deep Learning-Based Assessment of Visual Experience Based on Human Factors
  • 1 – Guide for Architectural Framework and Application of Federated Machine Learning

Conclusion: still too early, worth keeping an eye on

k) ISO standards under development

Several working groups at ISO are also working on AI/ML specific standards:

  • ISO 20546 – Big Data – Overview and Vocabulary
  • ISO 20547-1 – Big Data reference architecture – Part 1: Framework and application process
  • ISO 20547-2 – Big Data reference architecture – Part 2: Use cases and derived requirements
  • ISO 20547-3 – Big Data reference architecture – Part 3: Reference architecture
  • ISO 20547-5 – Big Data reference architecture – Part 5: Standards roadmap
  • ISO 22989 – AI Concepts and Terminology
  • ISO 23053 – Framework for AI using ML
  • ISO 23894 – Risk Management (ISO 31000, not 14971)
  • ISO 24027 – Bias in AI systems and AI aided decision making
  • ISO 24029-1 – Assessment of the robustness of neural networks – Part 1 Overview
  • ISO 24029-2 – Formal methods methodology
  • ISO 24030 – Use cases and application
  • ISO 24368 – Overview of ethical and societal concerns
  • ISO 24372 – Overview of computations approaches for AI systems
  • ISO 24668 – Process management framework for Big data analytics
  • ISO 38507 – Governance implications of the use of AI by organizations

The first standards have already been completed (such as the one described below).

Conclusion: still too early, worth keeping an eye on

l) ISO 24028 – Overview of Trustworthiness in AI

ISO/IEC TR 24048 is entitled “Information Technology – Artificial Intelligence (AI) – Overview of trustworthiness in artificial intelligence.” It is not specific to any particular domain, but it does give examples for the healthcare sector.

The standard summarizes important hazards and threats as well as common risk minimization measures (see Fig. 1).

mindmap of ISO/IEC 24028 2020 with its branches
Fig. 1: Chapter structure of ISO/IEC TR 24048 as a mind map (click to enlarge)

ISO-IEC-24028-2020: Chapter structure mind mapDownload

However, the standard stays quite universal, does not give any concrete recommendations and does not establish any specific requirements. It is useful as an overview and an introduction, and as a reference to other sources.

Conclusion: recommended, with conditions

m) WHO/ITU Ai4H guidelines

The WHO and ITU (International Telecommunication Union) are developing a specific framework for the use of AI in healthcare, in particular for diagnosis, triage and treatment support.

This AI4H Initiative includes several topic groups from various medical faculties as well as working groups looking at cross-sectional topics. The Johner Institute is an active member of the regulatory requirements working group.

This working group is developing a guideline that is based on the Johner Institute’s previous guideline and will potentially replace it. The plan is to coordinate the results with the IMDRF.

If you would like to know more about this initiative, please contact the ITU or the Johner Institute.

Conclusion: highly recommended for the future

3. Audit questions you should prepare for

Notified bodies and authorities have still not agreed on a uniform approach and common requirements for medical devices with machine learning.

Therefore, manufacturers regularly find it difficult to prove that the requirements placed on the device, e.g. with regard to accuracy, correctness and robustness, have been met.

Dr. Rich Caruana, one of Microsoft's leading minds on artificial intelligence, even advised against the use of a neural network he himself had developed to propose the appropriate therapy for pneumonia patients:

“I said no. I said we don’t understand what it does inside. I said I was afraid.”

Dr. Rich Caruana, Microsoft

The existence of machines that users do not understand is nothing new. You can use a PCR without understanding it; in any case, there are people who know how the device works and what is inside. However, this is no longer always the case with artificial intelligence.

The questions that auditors should ask manufacturers include:

Key question

Background

Why do you think that your device represents the state of the art?

Classic starting question. In your answer, you should go into technical and medical aspects.

How did you reach the assumption that your training data has no bias?

Otherwise the results would be wrong or only correct under certain conditions.

How did you avoid overfitting your model?

Otherwise, the algorithm would only correctly predict the data it was trained with.

What makes you assume that the results are not just randomly correct?

For example, an algorithm could correctly decide that an image contains a house. But it could be the case that the algorithm did not recognize a house, but the sky. Another example is shown in Fig. 3.

What requirements does the data have to meet so that your system correctly classifies it or predicts the correct results? Which framework conditions have to be observed?

Since the model was trained with a certain quantity of data, it can only make correct predictions for data coming from the same population.

Would you not have achieved a better result with another model or with other hyperparameters?

Manufacturers must minimize risks as far as possible. These risks also include risks resulting from incorrect predictions made by sub-optimal models.

Why do you assume that you have used enough training data?

Collecting, processing and “labeling” training data is time-consuming. The bigger the dataset used to train a model, the more powerful it can be.

Which standard did you use when labeling the training data? Why do you consider the chosen standard to be the gold standard?

Particularly if the machine starts to be superior to people, it becomes difficult to determine whether a physician, a group of “normal” physicians, or the world's best experts in a discipline are the reference.

How can you ensure reproducibility if your system continues to learn?

Continuously learning systems (CLS), in particular, must ensure that the further training, at the very least, does not reduce performance.

Have you validated the systems you are using to collect, prepare, and analyze data, and to train and validate your models?

An essential part of the work consists of collecting and processing the training data and using it to train the model. The software needed for this is not part of the medical device. However, it is subject to the computerized systems validation requirements.

Table 1: Potential questions during the verification of medical devices with corresponding answers

The questions mentioned above are typically also discussed in the course of the ISO 14971 risk management process and the clinical evaluation according to MEDDEV 2.7.1 Revision 4 (and performance evaluation of IVD medical devices).

different pictures of chihuahua and muffins and an indicator how sure the programm is
Fig. 2: Input data that only randomly looks like a certain pattern. In this example, a Chihuahua and a muffin (source) (click to enlarge)

Further information

Tips on how manufacturers can meet these regulatory requirements for medical devices with machine learning can be found in the “Artificial Intelligence in Medicine” article.

4. Conclusion and summary

Regulatory requirements

The regulatory requirements are clear. However, it is still not clear to manufacturers and, in some cases, even authorities and notified bodies how they should be interpreted and implemented for medical devices that use machine learning methods.

Too many and only partially helpful “Best Practice Guides”

As a result, a lot of institutions feel obliged to help by publishing “best practices.” Unfortunately, a lot of these documents are only of limited use:

  • They repeat textbook knowledge about artificial intelligence in general and machine learning in particular.
  • As a result, the guidance documents discuss self-evident facts and banalities.
    Anyone who didn’t know before reading these documents that machine learning can lead to misclassifications and biases that can endanger or negatively affect patients should not be developing medical devices.
  • A lot of these documents simply list the specific machine learning problems that manufacturers need to address. But there are no best practices on how to minimize these problems.
  • Where there are recommendations, they are usually not very specific. They do not provide sufficient guidance.
  • It is difficult for manufacturers and regulatory authorities to extract truly testable requirements from such barren texts.

Unfortunately, no improvement seems to be in sight. On the contrary: more and more guidelines are being developed. For example, the OECD recommends the development of AI/ML specific standards and is currently working on one itself. It is the same with the IEEE and the DIN, and numerous other organizations.

Conclusion:

  • There are too many standards to keep track of them all. And more and more are continuously being added to the pile.
  • The standards overlap a lot and are generally limited use. They do not contain any (binary) test criteria.
  • They arrive (too) late.

Quality not quantity

Medical device manufacturers need more quality not quantity from the best practices and standards on machine learning.

Best practices and standards should provide guidance for actions and set verifiable requirements. The fact that the WHO is using the Johner Institute's guidelines as a basis gives us reason for cautious optimism.

It would be nice if notified bodies, authorities and possibly also the MDCG would be more actively involved in the (further) development of these standards. This process should be transparent. We have seen on several occasions recently what working in back rooms without (external) quality assurance can lead to.

A joint approach would make it possible to achieve a common understanding of how medical devices that use machine learning should be developed and tested. There would only be winners.


Notified bodies and authorities are cordially invited to participate in the further development of guidelines. Just an email to the Johner Institute is enough.

Manufacturers looking for support in the development and authorization of ML-based devices (e.g., for the review of the technical documentation or the validation of ML libraries) can get in touch with us by email or via the contact form.

With thanks to Pat Baird for the helpful input.

Author:

Prof. Dr. Christian Johner