Finding Important Relationships in Large Datasets

An interview with Ann-Kristin Becker, a PhD student at the Greifswald University Hospital, about how Bayesian networks can be used in systems biology to identify what variables are interdependent and to what degree

Can’t see the forest for the trees? It can be very difficult to find relevant relationships in large, diverse datasets. The bioinformatics expert Ann-Kristin Becker, who is a member of the LiSyM network for systems biological research, was able to identify clearly which of the 500 biomedical parameters relate to nonalcoholic fatty liver disease, or NAFLD, using Bayesian networks (BNs). In order to achieve this, Becker, who is also a PhD student in Professor Dr. Lars Kaderali’s lab at the Institute of Bioinformatics at the Greifswald University Hospital, fine-tuned the BN approach. Not only did she reduce the manual work and computing time necessary, she also obtained more meaningful results. This method can also be used to solve problems outside of biomedicine. In the following interview, Becker talks about what improvements she made and how BNs work in general.

1. What are Bayesian networks?

Ann-Kristin Becker: Simply put, they are models used to identify how things are connected. For example, they can show what factors increase the risk of developing a disease, or what happens when several such factors occur at the same time. The networks also allow you to assess whether these relationships are direct or indirect as well as how weak or strong they are. Bayesian networks are named after the English mathematician and statistician Thomas Bayes, who lived in the 1700s.

2. What can BNs be used for besides biomedicine?

Becker: They can be used in bioinformatics and whenever patterns need to be analyzed. They are widely used in speech recognition, image processing, medical diagnoses, the analysis of consumer behavior, spam filters, and many other fields. In these cases, BNs are used to calculate how well single or multiple variables correspond to a pattern while also providing the probability for this. For example, with emails, BNs decide whether an email is spam or not within fractions of a second. They can be used whenever you have many factors interacting with each other and decisions must be made.

3. Can you explain this with a simple example?

Source: Wikipedia

Becker: Wikipedia has an example that is easy to understand. It consists of three variables, or nodes as they are called in Bayesian networks: “rain,” “sprinkler,” and “grass wet.” When it rains or when the sprinkler is on, the grass gets wet. The BN graphical representation shows an arrow pointing from each of the upper two nodes for “rain” and “sprinkler” to the node “grass wet” below. There is also an arrow pointing from “rain” to “sprinkler” that represents the causal relationship between them in which the sprinkler is unlikely to run when it rains.

4. Can arrows between two nodes also point in opposite directions?

Becker: No. All arrows linking nodes go only in one direction in BNs, as in the example on Wikipedia. The arrows are like one-way streets. They also never come full circle, not even when there are three, four, or more nodes. The result is therefore a directed network. This type of network allows you to see how one node is connected to another, and whether this connection is direct or indirect via one or more nodes.

5. Woher stammen die Wahrscheinlichkeiten, die in den Tabellen stehen?

Becker: Normally, from the data available. I suspect that the Wikipedia example is fictitious, however. If it were based on real data, the table would have predicted under the rain node that it rained in 20 percent of all recorded cases and did not rain in 80 percent. So, if you have a lot of relevant data, you can deduce or estimate probabilities, like the 20 percent likelihood of rain on a given day in the Wikipedia example. If real data were used, then the sprinkler node in the table would show that the sprinkler was only active one day out of 100 when it was raining, making the probability of the sprinkler running on rainy days 0.01 percent as opposed to 40 percent on days without rain. The large difference between these two values indicates a dependency between those nodes. In the fictitious Wikipedia BN example, rain clearly affects whether the sprinkler runs or not.

6. Who or what calculates these dependencies?

Becker: There are many ready-to-use tools for BNs that are based on machine learning and modern artificial intelligence. Usually, thousands of possible models must be tested to find the one that fits best. The more data you have to process and the greater the number of variables, the more time-consuming it becomes. For my research project, I learned which dependencies or relationships exist between 500 different variables taken from the medical data of 4,000 patients. This process occurs automatically with the help of algorithms, of course, and always results in a directed network that shows all of the relationships between all variables in my data.

7. How can you find a link to a certain variable, in your case to NAFLD?

Becker: You just have to look at the area in the network where the variable you want is located. All of the variables connected through arrows to my NAFLD node, either directly or indirectly via other nodes, have a strong relationship to it. BNs are directed, after all. That is why all of the other variables connected with NAFLD via many other nodes are on the edge of the relevant area, or outside it. These variables have only a weak relationship to my variable, or none at all. This is a major advantage of good Bayesian networks: They differentiate between direct and indirect influences.

In Bayesian Networks, algorithms are used to develop BN models of direct and indirect relationships between factors.

8. Can we differentiate between correlations and causal relationships?

Becker: That is a question of interpretation. For one, Bayesian networks only describe dependencies. If the Wikipedia example were based on a real situation, then the probability that the grass gets wet would depend strongly on whether it was raining or if the sprinkler was on. The grass is highly likely to be wet if it is raining or if the sprinkler is on. If neither is the case, then the probability is naturally much lower. This means that rain is a possible cause for the grass being wet, as is indicated by the arrow. Whether or not dependencies can really be interpreted as causal relationships depends on the quality of the BN and the quality of my data. The network gives me at least hypotheses that I can test further.

9. What did you change in this approach?

Becker: As I said, I had about 500 variables to work with. There were just too many possible ways for them to be connected. Learning the model would have been extremely time consuming and would have resulted in a network with too much information that would be very difficult to understand. I wanted to have a well-organized graph which quickly showed the most important relationships. I also wanted to keep the computing time in check. So, simply put, I used a cluster method to structure my variables into meaningful groups.

10. What are the benefits of grouping variables?

Becker: It reduces the number of nodes. I took similar variables, like body fat, body mass index, and other factors relating to body weight, and I grouped them together into one node. It is much easier to learn a network with fewer, connected nodes. So as not to overlook anything important for NAFLD, I developed another method. Step by step, I broke down the most important groups—the ones most closely associated by arrows with NAFLD—into smaller subgroups. This lets me zoom in on the area in the network that is important for NAFLD automatically, so to speak. This area can be studied in detail while maintaining an overview of the entire network.

11. What skills do you need to be able to work with Bayesian networks?

Becker: You need a basic knowledge and understanding of Bayesian networks. If you know a little about computer science and probability calculus, you should be able to read up on it. In general, it’s not so difficult to work with BNs. It’s when you have large amounts of heterogeneous data that it gets difficult. Tables are ideal for organizing data, but there are many other methods of working with Bayesian networks that are useful for many different types of data and data formats.

12. Are you satisfied with your results, and would you recommend BN and your approach to others?

Becker: My project went very well. I would recommend BN especially for interdisciplinary projects, because these networks make connections more accessible than statistics alone. My cluster approach has major benefits when working with a lot of heterogeneous data that would otherwise make it difficult for researchers who are experts in their particular fields to maintain an overview. This method reduces the manual work and computing time required, while also making the information Bayesian networks provide more meaningful.

Back to Feature Articles