HAW-Hamburg: 'Data is the new gold.'

08.12.2020 | Forschung

DASHH Graduate School

How can the data-intensive experiments of the future be designed and evaluated? The academic discipline of data science is a central research focus in Hamburg. At the Helmoltz DASHH Graduate School, which was initiated by the DESY Research Centre, researchers are investigating the analysis of large amounts of data. A new generation of data scientists will analyse and prepare the data-intensive experiments of the future. Four computer science professors from HAW Hamburg have been accepted as principal investigators at the DASHH Graduate School.

HAW Hamburg Computer Science, brochure image

Computer science students at HAW Hamburg

Since 2019, Hamburg has been home to the DASHH Graduate School, which is short for Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter. The graduate school provides young researchers with interdisciplinary and application-oriented training in the processing and analysis of large amounts of data produced through fundamental research on the structure of matter. The DESY Research Centre is the Helmholtz Centre responsible for the project and one of eight DASHH partners. Researchers from various institutes and universities in Hamburg and its surroundings are participating as partners in DASHH, including scientists from HAW Hamburg.

The goal of the graduate school is to train a new generation of data scientists and provide them with the skills necessary to design and evaluate the data-intensive experiments of the future. Such experiments are taking place worldwide at leading large-scale research facilities – for example, the particle accelorators PETRA III and Large Hadron Collider (LHC) or the European XFEL x-ray laser – in order to decipher the structure and function of matter. Matter can refer to the smallest particles in the universe, the building blocks of life, or viruses such as Sars-CoV-2. The 15 doctoral students currently studying at the graduate school are investigating various questions stemming from structural biology, medicine, partical physics, materials sciences and the operation of modern accelerator facilities.

Above all, though, they are learning to use innovative approaches to evaluate the large amounts of data resulting from their work. To this end, each student is supervised as part of the DASHH programme by a tandem team made up of a subject specialist and a mathmetician or computer scientist. Data science is an interdisciplinary science which aims to combine the natural sciences with mathematics and computer science, regardless of the discipline or whether the research is pure or applied.

'Data is the new gold.'

'Data science solutions are used to obtain knowledge from data,' explains Marina Tropmann-Frick, professor of data science in the HAW Hamburg Department of Computer Science. She and other colleagues from her department now belong to the roughly 70 principal investigators (PIs) at the DASHH Graduate School. 'Today, academic institutions, government agencies and companies generally collect and process data. But the data itself, which is usually a large, unstructured mass and stems from heterogeneous sources, is not valuable in and of itself. What is valuable is the information generated via targeted analysis and user-friendly preparation with the help of data science.'

This is done using methods from statistics, machine learning and artificial intelligence. In addition to data management and evaluation, modern data processing focuses above all on the visualisation of data. This makes data easier to understand, makes it possible to draw conclusions from it and means it can be used to make predictions. However, this requires asking the right questions in order to find the correct answers. Often, solutions are being sought for complex questions that have not yet been completely formulated. 'In order to simplify and further develop this kind of question, semi-automatic methods that integrate the concept of humans into the loop are used,' says Tropmann-Frick, who holds a PhD in engineering. Put simply, this means that the focus is always on people.

A better understanding of data helps with predictions

Tropmann-Frick explains what this looks like in reality using the example of a current research project. The project deals with software solutions for analysing data from pharmacovigilance, which is the monitoring of adverse side effects from medications. National and international data banks collect this information, evaluate it and make it available to various users. The most well-known examples are the Canadian DrugBank or the U.S. Food and Drug Administration (FDA).

In Germany, the Federal Institute for Drugs and Medical Devices (BfArM) is responsible for doing this. Together with this institute and pharmacologists from Christian-Albrechts-Universität zu Kiel, a team including Tropmann-Frick is working on the development of a web-based application that can be opened in a browser. This will make it available to everyone, so that people besides just doctors can search it for particular medications. 'There is always a large number of algorithms running behind this kind of web programme,' says Professor Tropmann-Frick. 'That's where all the intelligence is.'

Recognising patterns, visualising data, extracting knowledge

It starts with the collection and analysis of data sets from different sources. This raises a number of questions for computer scientists: Where do the data come from and how were they collected? How do we integrate dozens or even hundreds of different sources? How do we clean the data of duplicates? How do we deal with missing or incorrect data – for example, when the medication is listed with its dosage, 'pain medication xy 600', in one place and only the name is given in another? This is very difficult. The data also needs to be 'normalised', which means that it is converted to the same unit, such as grams to milligrams or vice versa.

Preparing the data sets takes up roughly 80 per cent of our time.

Prof. Marina Tropmann-Frick, Professor of Data Science

'Preparing the data sets takes up roughly 80 per cent of our time,' estimates Tropmann-Frick. Only then can the actual analysis start, with the help of machine learning, deep learning (neural networks) and statistical methods. The results include image and pattern recognition or natural language recognition. Following the analysis, the data is visualised for the user – for example, by displaying it graphically. The aim is to ensure that data are presented in an understandable way.

The human-in-the-loop concept

The human-in-the-loop concept implies that each of these data-processing steps involves an interaction with humans – either with the experts from the subject domains or with users from different groups, such as doctors, patients and companies. However, the concept also includes the ethical action of data collectors and programmers, emphasises Tropmann-Frick. For example, their efforts to ensure the protection of personal privacy. This takes place via privacy by design: 'We design our systems from the beginning in such a way that individuals' identity is protected. In this way we avoid, as things stand today, having to correct something retroactively due to a lack of data protection.'

It will take a number of years until there are more experts trained in data science. According to the professor, the ideal data scientist has extensive knowledge in math, stochastics, computer science (especially algorithms and machine learning) and visual analytics. Communication, management and organisational skills are also necessary, as is an interest in one of the natural sciences or medicine ('application domain').

Author: Monika Rößiger

Additional information

DASHH Graduate School

HAW Hamburg Department of Computer Science

HAW Hamburg's team of experts at the DASHH Graduate School

Prof. Marina Tropmann-Frick is a data analyst specialised in data modelling, data engineering and data mining.
Prof. Thomas C. Schmidt specialises in distributed sensor technology and data pre-processing in the Internet of Things.
Prof. Peer Stelldinger approaches the problems from a theoretical perspective and develops algorithms for machine learning.
Prof. Jan Sudeikat focuses on practical problems in the development of adaptive (control) systems.

As part of the DASHH Graduate School, the HAW Hamburg researchers also plan to carry out interdisciplinary doctoral projects.

Contact

HAW Hamburg
Faculty of Engineering and Computer Science
Department of Computer Science

Prof. Dr. Marina Tropmann-Frick
Marina.Tropmann-Frick (at) haw-hamburg (dot) de

Prof. Dr. Thomas C. Schmidt
T.Schmidt (at) haw-hamburg (dot) de

Prof. Dr. Peer Stelldinger
Peer.Stelldinger (at) haw-hamburg (dot) de

Prof. Dr. Jan Sudeikat
Jan.Sudeikat (at) haw-hamburg (dot) de