Sponsored by Avenga
Federated learning is maturing faster than it is being given credit for
Jacek Chmiel, Director of Avenga Labs, updates the readers on the current state of federated learning in pharma and bioscience.
Pharma and bioscience in general want to benefit more and more from the vast amounts of data being generated every day by all the participants within the healthcare arena, including the drug discovery and development ecosystems.
Jacek Chmiel, Director of Avenga Labs
The future of healthcare business entities and their patients is computational and it affects everything. The exponential growth of AI, including more traditional Machine Learning (ML) and large language models (LLMs), is a fact.
However, the pace of progress is limited by the elephant in the room, which is data availability. Organizations are heavily regulated and are no doubt going to protect their data for a variety of reasons, including market competitiveness, and regulatory and ethical considerations.
The broad adoption of federated learning (FL) was started years ago by tech giants who used it to train on device models in order to create better models using federated aggregation. Then, it moved into other industries like financial services and bioscience.
In the case of pharma and healthcare, the so-called siloed centralized FL strategy is the most popular choice of topology.
Imagine the hospital setting, where they perform CT scans and store tons of imaging and diagnostics data locally in their respective IT infrastructures. It’s usually terabytes or petabytes of data, and it is also very sensitive in nature. The diagnostic process is supported by automated image segmentation that identifies lesions in the brain tissue and classifies them.
Local hospital sites train the model on their data and share the model with a federated training coordinator, as do other hospitals. The aggregated model is then shared back with local sites, and they train the model again and again. This is called a ‘round’, and there are tens or hundreds of such rounds that end up creating a well-performing global model which is, indirectly and without sharing any data, benefitting from all the local data.
Other prominent imaging examples are COVID-19-damaged lungs, breast cancer, diabetic retinopathy, pancreas tissue segmentation, etc.
The options for FL include all modalities of the data; for instance, prediction models for patients that are based upon their diagnostics for treatment efficacy and patients' survivability. Or, the efficiency of the treatment of diabetes patients while taking into account multiple sources of data, including constant glucose monitoring devices (CGMs).
Being able to benefit from the data without moving and protecting the data is the key objective of the federated paradigm for ML and data analytics.
Even the sheer “not moving the data” element is very beneficial as it enables a faster time to market by skipping the building of a common data lake, complex ETL (extract, transform, load), or ELT (extract, load and transform) pipelines. Because of this, the federated approach is worth considering even when ML is done across trusted organizations or even departments.
Privacy of the data is the main deliverable of the FL approach as the data is never seen by anyone except local data custodians and their teams. Everything is controlled by the local data custodians, and not by external entities. This is a fundamental value that goes far beyond usual security measures, as there’s no data to be leaked in the first place. What is required is a proper federated network, tools selection (including privacy enhancement techniques (PETs) such as differential privacy and addressing the too-curious data scientist’s problem), and configurations. Assuming it’s done properly, the risk of data leaks is reduced to the level of not being considered a data-sharing project, thus conforming to GDPR and other privacy regulations. The decisions are risk-based and depend upon the given context and network configuration; like, is the guarantee high enough, or, as a precautionary measure, do additional papers need to be signed just in case?
Crossing regulatory zones is a key benefit as well. There are federated projects between the EU and Canada, the EU and the US, etc. Entire new scenarios, previously unavailable and considered a “no go”, are now worth considering again, because of the much higher data privacy protection than in a classical approach. The same applies to local scenarios within the same organization, as the federated approach can be used for cross-department ML or cross-country ML in multinational companies.
In a common scenario, local nodes (i.e., hospitals) contribute their locally trained models in a privacy-preserving way to build a central and more generalizable model that is expected to perform better than their local model, especially for future patients. So there’s an immediate reward for participation in a federated network, as better models lead to better medical decisions and patients’ outcomes everywhere.
Training and tuning the models also include generative models such as large language models (LLMs). Thus, the FL is in full support of the exponentially growing need for private model training and tuning, including Retrieval-Augmented Generation techniques (RAGs).
Federated learning (FL) is more an edge type of learning that happens locally, close to the data, and without any data shared publicly. It requires an edge-oriented approach to the federated architectures, and more than the copy-pasting of the usual ‘high scalability’ approaches for centralized transactional or data applications.
A federated network, from a logical point of view, is a set of FL agents that connect directly to their local data sources and perform training, model testing, and inference. In the typical centralized topology, there’s a coordination node that coordinates workflows of rounds for federated training and testing of the models.
It also involves local authentication and authorization management, which is the opposite of the usual centralized approach. There are different solutions to address that need, which can include trusted 3rd party providers and different federated authentication frameworks.
The tools include open source players, who share their code on GitHub, and publish new versions and black box solutions from commercial product vendors. They also offer FL as a service with various degrees of privacy guarantees as well as their legal responsibilities for data protection.
They include matured players such as NVIDIA Flare for FL by NVIDIA, and DataSHIELD for federated analytics, as well as multiple younger counterparts not reaching their 1.0 versions yet. They generally come with a set of ready-to-use trainers (for ML) and federated statistics functions, plus sets of examples that demonstrate how to use them. The built-in capabilities are augmented with vibrant communities sharing their experiences, code, libraries, and datasets.
The tools also provide extensibility mechanisms that enable custom ML algorithms, different aggregation, and smooth workflow schemes.
There is tons of experimentation and research about the new capabilities of the FL frameworks and products, as well as growth in the maturity of the market leaders both in open source and commercially available products.
Federated learning (FL) is a collaborative effort requiring multidisciplinary teams, being agile in spirit, and having an open mindset in order to embrace the new paradigm. Experience proves this is the main difference between the traditional old ways and a modern federated approach.
For instance, data scientists are used to working in their local notebooks with full access to data, such as being able to modify data on the fly, view the data, and experiment with the data. In a federated approach, once it goes to production with real-world data (RWD), it’s not allowed to see the other sources’ data directly as it would violate the main privacy guarantee of the federated model. It may or may not be required in a given scenario, but working with data indirectly to detect problems with data, algorithms, and configuration, requires collaboration with the data scientists on the other side (data owner side).
A mind shift also applies to all the established DevOps experts who are used to centralized cloud-based approaches. They are prone to following established patterns of putting code into containers, centralized authentication, and authorization, instead of embracing a new privacy-preserving kind of architecture.
The change towards FL is a great opportunity to learn, and many people are looking at the federated approach as an interesting opportunity, as they are open to challenging themselves and embracing new ways to work.
Once the mental barriers are broken and new collaboration paths are established, the outcomes of the ongoing discussions and contributions are beneficial for all the parties involved.
The typical themes in a federated setup are the discussion about what is considered disclosive and private enough, the balance between model protection and its performance, and the quality of the data.
The benefits of the federated approach are more visible in longer-term consortia, as federated networks that were built can be reused for different scientific projects.
Federated learning (FL) has worked and proven its business value for years already, but still, there’s a lot of room for growth and improvement in its technological solutions.
Regulations are lagging, as they always do, however we can refer to FL as one of the ‘state of the art’ privacy-preserving strategies as it enforces conformity with privacy and security regulations. There’s hope that the federated approach will be directly defined and recognized by the regulatory laws. And, the risk-based approach to the legal framework is not going away anytime soon.
Federated learning (FL) is actually growing faster within academic and research communities than in the enterprise commercial market. It is a modern approach and since it is already beyond the technology experimentation phase, there’s no technological barrier to using it. However, its value is yet to be fully recognized by the healthcare and life sciences industry. In the AI arena, it should more often be considered as the primary method of conveying multi-organizational data analytics and ML projects.