Since 2019, Hamburg has been home to the DASHH Graduate School, which is short for Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter. The graduate school provides young researchers with interdisciplinary and application-oriented training in the processing and analysis of large amounts of data produced through fundamental research on the structure of matter. The DESY Research Centre is the Helmholtz Centre responsible for the project and one of eight DASHH partners. Researchers from various institutes and universities in Hamburg and its surroundings are participating as partners in DASHH, including scientists from HAW Hamburg.
The goal of the graduate school is to train a new generation of data scientists and provide them with the skills necessary to design and evaluate the data-intensive experiments of the future. Such experiments are taking place worldwide at leading large-scale research facilities – for example, the particle accelorators PETRA III and Large Hadron Collider (LHC) or the European XFEL x-ray laser – in order to decipher the structure and function of matter. Matter can refer to the smallest particles in the universe, the building blocks of life, or viruses such as Sars-CoV-2. The 15 doctoral students currently studying at the graduate school are investigating various questions stemming from structural biology, medicine, partical physics, materials sciences and the operation of modern accelerator facilities.
Above all, though, they are learning to use innovative approaches to evaluate the large amounts of data resulting from their work. To this end, each student is supervised as part of the DASHH programme by a tandem team made up of a subject specialist and a mathmetician or computer scientist. Data science is an interdisciplinary science which aims to combine the natural sciences with mathematics and computer science, regardless of the discipline or whether the research is pure or applied.
'Data is the new gold.'
'Data science solutions are used to obtain knowledge from data,' explains Marina Tropmann-Frick, professor of data science in the HAW Hamburg Department of Computer Science. She and other colleagues from her department now belong to the roughly 70 principal investigators (PIs) at the DASHH Graduate School. 'Today, academic institutions, government agencies and companies generally collect and process data. But the data itself, which is usually a large, unstructured mass and stems from heterogeneous sources, is not valuable in and of itself. What is valuable is the information generated via targeted analysis and user-friendly preparation with the help of data science.'
This is done using methods from statistics, machine learning and artificial intelligence. In addition to data management and evaluation, modern data processing focuses above all on the visualisation of data. This makes data easier to understand, makes it possible to draw conclusions from it and means it can be used to make predictions. However, this requires asking the right questions in order to find the correct answers. Often, solutions are being sought for complex questions that have not yet been completely formulated. 'In order to simplify and further develop this kind of question, semi-automatic methods that integrate the concept of humans into the loop are used,' says Tropmann-Frick, who holds a PhD in engineering. Put simply, this means that the focus is always on people.
A better understanding of data helps with predictions
Tropmann-Frick explains what this looks like in reality using the example of a current research project. The project deals with software solutions for analysing data from pharmacovigilance, which is the monitoring of adverse side effects from medications. National and international data banks collect this information, evaluate it and make it available to various users. The most well-known examples are the Canadian DrugBank or the U.S. Food and Drug Administration (FDA).
In Germany, the Federal Institute for Drugs and Medical Devices (BfArM) is responsible for doing this. Together with this institute and pharmacologists from Christian-Albrechts-Universität zu Kiel, a team including Tropmann-Frick is working on the development of a web-based application that can be opened in a browser. This will make it available to everyone, so that people besides just doctors can search it for particular medications. 'There is always a large number of algorithms running behind this kind of web programme,' says Professor Tropmann-Frick. 'That's where all the intelligence is.'
Recognising patterns, visualising data, extracting knowledge
It starts with the collection and analysis of data sets from different sources. This raises a number of questions for computer scientists: Where do the data come from and how were they collected? How do we integrate dozens or even hundreds of different sources? How do we clean the data of duplicates? How do we deal with missing or incorrect data – for example, when the medication is listed with its dosage, 'pain medication xy 600', in one place and only the name is given in another? This is very difficult. The data also needs to be 'normalised', which means that it is converted to the same unit, such as grams to milligrams or vice versa.