Taming the AI beast and its challenges through key expertise

At times seen as a threat, at others as an opportunity, the advent of generative AI is profoundly impacting our society and science. SIB scientists are aware of the challenges and are actively tackling them.

Applications of generative AI in bioinformatics already span a wide diversity of topics. However, one message cuts across these examples: there are no one-size-fits-all models, and caution must be exercised to ensure the benefits outweigh the costs. The road to trustworthy and ethical AI is indeed paved with challenges, from inaccuracies and toxic biases to environmental impact. SIB is the ideal environment where domain expertise and high-quality data come together to lead to AI models that benefit research and society alike.

Knowledge graphs are a powerful tool to connect and integrate insights from various sources. LLMs have the potential to revolutionize the way we interact with data by enabling us to directly interrogate them. Thus, knowledge graphs are an ideal complement to LLMs, in that they provide accurate and up-to-date information, covering domain-specific data that is not otherwise captured.

Ana-Claudia Sima

Knowledge representation Manager at SIB

Need for large quantities of high-quality data

To generate accurate predictions and outputs, but also to avoid biases that can lead to inequalities and ethical issues, models need to be trained on reliable, structured, labelled data.
Democratizing data to make them accessible and understandable by both humans and machines is at the heart of our work. We do this by ensuring our datasets follow the FAIR (Findable, Accessible, Interoperable and Reusable) principles, such as through knowledge graphs, i.e. maps showing how different pieces of knowledge are connected to each other (for instance a species, its genes, proteins and their bioactivity), helping us understand relationships and find useful insights more easily.

The Swiss AI initiative aims to leverage the new Alps supercomputer of the National Supercomputing Centre, to build academic instances of ChatGPT-like models. SIB scientists, including the group of Fabio Rinaldi, our Knowledge representation unit and Swiss-Prot group, are contributing data and use cases to the project, such as the universal protein knowledgebase UniProt. Incorporating such authoritative sources of knowledge will help ensure advances towards trustworthy AI.

Environmental impact

The larger the model, the more computing power and time to run, with a distinct impact on our carbon footprint.

Our teams fine-tune models to ensure the best fit depending on the needs, from domain-specific models trained on datasets such as PubMed with relatively few parameters, to general language models like GPT-4 with much larger training datasets and many more parameters. An SIB-wide focus group is also dedicated to study the environmental impact of our IT activity.

Finding the appropriate model

Researchers need to navigate a maze of increasingly diverse LLM models, each with their specificities and prior training sets.

The benchmarking performed by SIB experts among models in specific domains (e.g. biodiversity, proteins and clinical) serves as a guide to researchers worldwide.

Read the publication

Hallucinations

We have all witnessed mistakes in ChatGPT’s answers. But they may not be obvious if you are not an expert on the topic.

Critical evaluations are done by SIB’s domain experts, who excel at evaluating the models and who are able to interpret and detect mistakes in their answers. This is done, for instance, by developing specific tests to check the model’s output, such as mapping LLM-extracted biochemical reactions onto known ones to identify hallucinations.

Privacy concern for sensitive data

Unwanted third-party access to sensitive data such as personal information is a concerning aspect of the widespread use of LLMs.

The SIB Group of Janna Hastings, working with sensitive clinical data (e.g. historical clinical notes), is, for instance, setting up local instances of open-source models to enable clinicians to use the technology for real-world studies, without publicly sharing sensitive information.

Interdisciplinary work between model developers and domain experts

To improve the explainability and accuracy of LLMs, it is crucial that developers and domain experts work hand-in-hand.

As bioinformaticians and computational biologists, we have both the biological domain expertise and the ability to evaluate which algorithms are appropriate in a given context. This makes us strategic partners in the dialogue with LLM engineers on life science topics.

Read the full report on generative AI in the SIB Profile 2024