The Role of AI in Cultural Heritage

Interview with Clemens Neudecker - Head of Data Science at Berlin State Library

A N3XTCODER series

Implementing AI for Social Innovation

In this series we are looking at ways in which Artificial Intelligence can be used to benefit society and our planet - in particular the practical use of AI for Social Innovation projects.

In this installation, we are very pleased to speak with Clemens Neudecker, Head of Data Science at Berlin State Library.

Question N3XTCODER: Clemens, please can you explain to us what you do, and why AI is important to your work?

Clemens Neudecker: "I am head of data science in the information and data management department of the Berlin State Library (Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, https://staatsbibliothek-berlin.de/). Here I lead a team of 10 researchers and engineers where we work mostly in third-party funded research and development projects (see e.g. https://mmk.sbb.berlin/). Our research centres on computer vision, natural language processing and pattern recognition in the context of digitisation and cultural heritage, i.e. we develop and adapt algorithms and AI models that can be used to recognize and analyse the content in the documents our library holds and continuously digitises. We then make it available and accessible to the public online in the richest and most useful ways possible. The overarching goal is to have all the knowledge that is contained in the bookshelves available digitally and freely on the web for everyone to access and use."

Question N3XTCODER: In the past year, with of course the release of ChatGPT, the attention on AI has suddenly exploded. Do you see that as on balance, a positive development for technology and society, or do you have concerns?

Clemens Neudecker: "Within our team, we began using AI - or rather machine learning/deep learning - already in 2016. By then, pre-trained neural networks (CNNs) started to outperform most other methods in computer vision tasks. In 2018, we incorporated LLMs into our work, following the emergence of BERT and the Transformer Architecture with Attention. So far this required coding skills and expertise in mathematics and/or machine learning, and a lot of reading and understanding the relevant scientific papers. Now, when ChatGPT was introduced, what we saw was that due to the simplicity of the chat interface, all of a sudden every colleague in the library started to play around with it. While this is mainly a positive development, the communication around AI is currently dominated by companies who are marketing their AI solutions as magically capable of doing everything. In combination, this quickly led to the expectations skyrocketing. We spend a lot of time debunking these views, and to sober up the opinions of colleagues and management about the AI’s actual capabilities and limitations when it comes to our data and use cases. This is where I have started to get some concerns now - that we are getting overwhelmed in the way AI is introduced into our lives by big tech. This often means we do not allow ourselves the time and effort to enter into an inclusive, reflective and democratic process to determine and establish the culture of AI we actually want as a society."

Question N3XTCODER: How do the AI models you use compare to the more popular models available to the public?

Clemens Neudecker: "Popular AI models are typically trained from data that can be found on the web, and to work with contemporary content, whereas the documents that we digitise are predominantly of a historical nature. Due to copyright, libraries, archives and museums are bound by laws that prohibit them from digitising and redistributing materials online that were originally published less than 70 or even 100 years ago. This means that in our work we have to deal with documents and sources from a timespan of more than 400 years, since Gutenberg invented the printing press, up to the first half of the 20th century. And that content comes with an enormous variation in language, layout and other challenging features. Standard AI models struggle to properly extract the content found in these historical documents.This led us to develop our own algorithms and to adapt common AI models, so that they can be applied successfully also in the historical and cultural contexts of the digitised cultural heritage domain. This then opens up opportunities for researchers and scholars from the digital humanities to analyse the digitised and machine-readable data in quantitative ways, using computational methods. Last but not least, whenever we publish a model, we do not only provide the source code, but also we always release the data that it was trained on too, complemented by a datasheet (cf. https://doi.org/10.5334/johd.124) that documents it, following and adapting the approach formulated in Gebru et al., 2018 (https://arxiv.org/abs/1803.09010)."

Question N3XTCODER: I was interested to see you write about libraries consciously attempting to manage bias. Can you tell us about what practices libraries use to manage bias and how this can be used in the management of bias in AI?

Clemens Neudecker: "Biases are introduced through the data that an AI model is trained on. In my belief, this is the core aspect where libraries can really play an important role in the further development of AI. Literally for hundreds of years, libraries have employed subject experts (curators) across all scientific disciplines, whose main job iis to determine what is the relevant information and knowledge in those disciplines that deserves to be added to the collected memory of humankind, and onto the library shelves. But it does not end there: once something has been added into our collection, we also curate it, which means indexing, structuring, annotating and contextualising the information contained according to standardised rules of library information management that have been developed by a worldwide community. Therefore, we can revisit the policies of selection for what becomes part of the collection from the past, and what biases were reflected in these, and by extension in the collection, and make that transparent. And we can of course also use the AI to detect contentious content, like in a recent workshop. Furthermore, whenever our work in AI is targeting tasks that are performed by humans, we try to assess the impact with the help of experts and methods like Ethical Foresight Analysis."

Question N3XTCODER: What is “ Ethical Foresight Analysis?” And how does it work?

Clemens Neudecker: "Ethical Foresight Analysis is a collection of “hermeneutic” methods to support the predictive identification and assessment of ethical risks that may arise from the introduction of new technologies, services or applications, for different groups. For example, in one of our projects we are working on an AI tool that should help librarians by automatically suggesting keywords for new documents acquired for our collections. Here we collaborated with a researcher from Humboldt University Berlin who conducted structured interviews with members from different departments in our library, as well as with users and AI experts. This analysis helped us to identify different perspectives on the anticipated project outputs and how they are likely going to be used - which we could then incorporate better into our development process, e.g. by adding checks and constraints or by determining where we need an extra human-in-the-loop."

Question N3XTCODER: You have some experience self-hosting hardware for machine learning whereas platforms such as openAPI present a price per token approach (model-as-a-Service offering). Meanwhile quasi-open source models like llama allow for fine tuning (but you need compute from somewhere) and some sort of in-between level of autonomy. How do you perceive the economic trade-offs of these approaches? Are any of these economically prohibitive and is there a political necessity to address this in some way?

Clemens Neudecker: "We are indeed glad to have a small number of V100/A100 GPUs available to work with locally at the library, which allows us to play around with new models and perform model training without having to first worry about the potential cost of computation. I see this is a very valuable asset for our team, because I am afraid our devs might otherwise restrict or burden themselves too much with determining the expected cost to run an experiment in advance, as it can be very complex and error-prone to e.g. accurately estimate the amount of tokens in our datasets. Also, if we e.g. find out during training that we need to adapt the hyper-parameters, we can just do so and start a new training process without worrying about the costs entailed. On the other hand, we as a library cannot and should not try to keep up with the speed and obsolescence of AI hardware development in industry. Instead, for larger training tasks, we try to collaborate with supercomputing centres who provide dedicated cloud computing for academia and the public sector. Since we are bound by our own strict policies but also laws like GDPR when it comes to data protection, this sometimes means that our data must not leave our premises or virtual private network to be processed with AI. We therefore conduct projects also in this area, e.g. we collaborated with Gaia-X (https://gaia-x.eu/) to establish an environment where a federated network with built-in trust via OceanProtocol (https://oceanprotocol.com/) allows using the computation provided by the Gaia-X platform to run large-scale jobs on our data without it ever having to leave the secured environment."

Question N3XTCODER: Recently there has been a huge amount of political debate on regulating AI. Do you think this debate is necessary? And if so, do you have a sense of what issues regulation should focus on?

Clemens Neudecker: "A caveat: these are my personal and very subjective opinions on the matter. While I think that a broad debate on AI regulation is useful and necessary, I see it as currently too governed by extreme views, be it either AI hype or AI doomers. As with any new and complex technology that disrupts society, we need to ensure the widest democratic and participative processes rather than to leave this up to only a few. Generally, I believe we should not start to regulate the amount of GPUs or number of parameters of a model, but look at the use cases and conditions under which AI will be employed and what needs regulation. For example RAIL (Responsible AI Licences, https://www.licenses.ai/) attempts this."

Question N3XTCODER: What would be your practical advice to a website owner who wants to use AI derived texts and images on their website and in their marketing communication?

Clemens Neudecker: "Full disclosure! Be bold, but be open about it - and ensure there are ways to collect feedback from users."

Question N3XTCODER: What developments do you think we’re likely to see in AI in the future?

Clemens Neudecker: "A more recent development in AI are multi-modal models that are able to deal with both text and image data. I am curious to see what will happen once we see the first models that are trained with, and capable of working seamlessly with audio or video data as well. This should only be a question of computation, but it could open up a lot of additional data for training new and even more capable AI models."

N3XTCODER: Clemens, thank you very much for your time!

Was this article helpful? yes no

Join us in the conversation on various social channels. We discuss the latest developments in technology as they happen!

Join now

This article has been realised with the help of

Bundesministerium für Wirtschaft und Klimaschutz