Differential privacy allows us to assess data quality without ever accessing the data itself.

16 Jan 2026

No description

When we hear the term biological material bank (biobank), we often imagine large freezers filled with test tubes stored in locked rooms. However, it is important to realize that biobanks store not only samples but also data. For this reason, biobanks function both as data providers and as data archivists.Through BBMRI.cz, a large research infrastructure coordinated by the Masaryk Memorial Cancer Institute, the biobank team is actively involved in the development of practical procedures and tools within the European BBMRI-ERIC consortium. Just as biological material (samples) must meet quality standards, data quality also needs to be continuously monitored and controlled. But how can data be made accessible without compromising its protection and security?

The winning poster at the EOSC CZ 2025 National Conference was created by a trio of authors from BBMRI-ERIC, the Masaryk Memorial Cancer Institute, and Masaryk University – Radovan Tomášik, Ivan Mahút, and Simona Menšíková. We spoke with Radovan Tomášik about how the idea for the award-winning project came about, why data quality is always a matter of context, how *differential privacy works, and why research data should have a “long life.” 

You work in the field of medical informatics and simultaneously pursue a PhD focused on data quality in a federated environment. What drew you to this field, and what do you enjoy most about working with data in medicine?

I actually came to medical informatics by coincidence during my bachelor’s studies when I started working as a developer in the newly formed IT team led by Zdenka Dudová for the biobank at the Masaryk Memorial Cancer Institute in Brno. At first, it was just a student programming job, but I soon realized that medical informatics isn’t “just IT.” I saw how well-designed software can significantly affect the quality of data on which research depends — and ultimately the quality of science itself.

Although I work with data, I’ve always been more fascinated by software architecture – how to design systems that are understandable, sustainable, and can adapt even to situations their original creators never imagined. And if you ever want to make a system truly complex, make it federated. But that’s precisely what makes them fascinating: they are technically demanding but solve very real problems.

What I enjoy most is the combination of intellectual challenge and practical impact. During my PhD at the Faculty of Informatics, Masaryk University, I’m surrounded by great people who constantly push me forward. I often feel like the least intelligent person in the room – and that’s the best kind of motivation. Instead of mindless “coding” in a corporate setting, I get to solve problems that have real meaning and tangible impact on research and healthcare. And that’s deeply fulfilling.“I saw how well-designed software can significantly affect the quality of data on which research depends — and ultimately the quality of science itself.”

Your winning poster is titled Privacy-Preserving Data Quality Assessment for Federated Health Data Networks. How would you explain the main idea to someone unfamiliar with this topic?

History teaches us that centralizing data — and power — is tempting, but risky. Having everything in one place seems convenient, but it also creates a single point of failure — technical, organizational, and even social. In healthcare, this is particularly problematic due to privacy concerns.

A federated approach offers an alternative: data stay where they originate — for instance, in hospitals — which can then be shared or summarized for research. It’s a more realistic and safer model because it maintains both responsibility and control.

The key challenge is how to assess data quality without having access to all the data at once. And even more fundamentally, what does “data quality” mean? Quality isn’t absolute; it’s about fitness for purpose. What’s “good enough” for one study may be entirely unsuitable for another.

My research focuses on evaluating data quality without direct access to the data themselves. Hospitals don’t share the actual data but rather securely processed characteristics that allow us to assess their quality without compromising patient privacy.

You worked on the poster with Ivan Mahút and Simona Menšíková. How did your collaboration between Brno and Graz work in practice?

All three of us are part of the Czech node of BBMRI-ERIC, based at the Masaryk Memorial Cancer Institute in Brno. I’ve also been working at the BBMRI-ERIC headquarters in Graz, so our team is naturally divided between the two cities.

In Brno, we work under the auspices of the Assoc. Prof. Roman Hrstka brings the perspective of a biomedical researcher and practical experience with real-world hospital data. From Graz, we have strategic insight and informatics leadership from Assoc. Prof. Petr Holub, CIO of BBMRI-ERIC and my PhD supervisor.

This setup works beautifully because it connects two worlds that often operate separately – the day-to-day reality of working with data and the strategic framework of research infrastructures. Without this type of collaboration, our research would either lose touch with real practice or lack broader relevance. 

“It’s a bit like judging a book by its blurb — you don’t see the complete text, but you have enough information to decide whether it’s worth reading. Likewise, a researcher can determine whether a dataset is “good enough” for their purposes without ever having physical access to it.”

You work with highly sensitive healthcare data. What are the biggest challenges in assessing their quality, and what role does the principle of differential privacy play here?

The biggest challenge isn’t technical but practical. Every hospital uses slightly different formats, terms, and data structures, and even the term “sample” can have three different meanings. When we can’t agree on terminology, it’s tough to compare data quality.

In a decentralized environment, we also can’t simply “look” at all the data to judge their quality. That’s why we use a different model: data stays local, and only anonymized characteristics are shared, providing information about data quality but not revealing anything about patients.

And this is where differential privacy plays a crucial role. It’s a technique that adds controlled “noise” to shared values. The data remain useful for quality assessment, but can’t be exploited to identify individuals.

Put simply, differential privacy allows us to ask questions about data quality without ever seeing the actual data. It protects privacy even in extreme scenarios, not just under “normal” conditions.

Your approach enables data quality assessment across institutions without sharing the actual data. What potential do you see for this method within national or European infrastructures, such as BBMRI-ERIC or EOSC?

The potential is enormous. In infrastructures working with sensitive health data, such as BBMRI-ERIC or EOSC, centralization isn’t feasible. A federated model is a natural choice, but it brings its own set of challenges. Our method helps address those.

It enables secure, decentralized, privacy-respecting assessment of data quality across institutions. Instead of sharing the data themselves, only their protected characteristics are exchanged.

In distributed systems like BBMRI-ERIC, this can significantly enhance both trust and usability of the entire ecosystem. It gives researchers confidence that they are working with high-quality data, even if they never directly see it.

“Put simply, differential privacy allows us to ask questions about data quality without ever seeing the actual data. It protects privacy even in extreme scenarios, not just under “normal” conditions.”

Where do you see your research heading in the coming years?

I’m more of a practitioner than a theorist, so my goal is to turn research results into tools that work in real-world conditions. The first pilot deployment is already running within the Federated Search Platform of BBMRI-ERIC, and I hope to extend our approach to other data types, such as digital pathology or sequencing data. This would enable us to assess the quality of a broader range of biomedical information, bringing the system even closer to everyday research and clinical practice. 

The EOSC CZ 2025 National Conference carried the subtitle Long Live Research Data. What does this slogan mean to you personally?

To me, it’s a wish for research data not to be a one-off product that gets forgotten after publication, but to stay alive, to be reused years or even decades later in new projects and new contexts. And that’s precisely what open science enables. Transparency, sharing, and common standards give data the chance to live on, to “come back to life” in new analyses, new questions, and the hands of new researchers. Perhaps it’s a bit idealistic, but I believe that this kind of openness and faith in the value of sharing is what drives science forward.

Source: Lucie Skřičková , edited by Kateřina Nováková


*Differential privacy is a modern data protection technique that allows the analysis of aggregate information without the possibility of identifying individuals. It works by adding controlled statistical “noise” that safeguards privacy while preserving the scientific value of the data.


More articles

All articles

You are running an old browser version. We recommend updating your browser to its latest version.

More info