Node similarity in networks
A problem that Bioinformatics, Google and (probably) Facebook have in common
It’s now been a couple of months since I started an internship at a research institute in a bioinformatics team that works with cancer research, more precisely with trying to figure out why certain cancerous cells will survive treatment where others will die. One area of research that I had to familiarize myself with was a rather interesting one: trying to find similarities between certain biological things. One big example of this is trying to figure out what a given protein’s function is by looking at other proteins and trying to figure out what other proteins it is similar to, what those other proteins’ function is, and then using that information to make a prediction of what the function of the protein in question is.
Another interesting use of something that could be interpreted as similarity is in the research of genes that are responsible for the expression of a given phenotype. Nowadays biologists can sequence the gene of an organism and then compare the expression level of genes with the expression of a given phenotype trying to find genes the genes that most likely are responsible for that phenotype. The problem is that the experimental results are still somewhat noisy, and so genes that might be important might appear less expressed, or genes that might be less important might appear more expressed, so a way to refine those readings is to look at the results and try to find not only the genes that were most expressed, but then look at the genes that are most “similar” to them, the idea being that strongly expressed genes that are “similar” to other strongly expressed genes are more likely to actually be linked to the phenotype.
Alright then, but how to define similarity? That’s where networks come in. Biological Networks (or bionets) to be more precise. Now, there are different types of networks, and each of those types has different actual networks that have their upsides and downsides, so I wont get too much into it. But one thing that you can do to establish similarity between two nodes in a network is to use the structure of the network. A simple way to measure this is through the distance between two nodes. Nodes that have an edge between them are probably more similar than nodes that don’t. This is an idea that was very strong in Google’s original PageRank algorithm: pages were nodes and the edges between them indicated whether or not there was a link from one to the other, pages that were ranked lower initially could have their ranks raised if they were directly connected to pages that ranked highly. But as you probably already suspect, I didn’t pull Google and PageRank out of thin air. There is a paper from 2005 (Morrison et al., 2005) investigating exactly what I was talking about before: how to best rank Genes in regards to a given phenotype based on expression data. In it, a method called GeneRank is developed, but I’ll let the abstract explain it a little more:
So what they did was basically this: the gene expression data gave a natural initial ranking to the genes. Then they used networks that related the genes in order to up the ranks of genes that were more closely related to already well-ranked genes.
To me it was extremely interesting to see an algorithm developed for one end (ranking pages by a search engine) being so lightly adapted to another completely different end (finding out what genes are more closely related to a phenotype), and it changed the way I looked into these kinds of problems moving forward, which brings to another really interesting paper from 2013 that tackles another problem from bioinformatics: prediction of protein functions, and that I best understood the efficacy in terms of social network.
The paper (Cao et al. 2013) introduces a metric for similarity between nodes in a network that they called Diffusion State Distance (or DSD) that consists of calculating the distance of one node to every other node on the graph through random walks (this is used because it is a more robust method of calculating distances on a corrupted graph, maybe more on that later) and then takes the difference between these results for two nodes, and the result is a measure of similarity (or rather of difference, seeing as higher result means more different). At first I kind of accepted that this was correct, I was already habituated to the idea of using random walks to measure distance, and using distance to measure similarity is very intuitive in a network. But they go further than this. In this case the network that they use is a Protein-Protein Interaction Network (PPIN), which is a network that encodes proteins as nodes and connects proteins that are known to interact with each other (These are extremely useful and are the basis of what I’m currently working on). The idea that this paper introduces, then, is that instead of just looking at how easy or how hard it is to get from one protein to another as a measure of how similar they are we should also look to what those proteins’ neighborhoods look like, the intuition being that two proteins that are somewhat close to the same proteins might be more similar than proteins that might be slightly closer to each other. But the intuition that really stuck in my head was this: While you are probably considerably similar to all of your friends, you’re probably even more similar to someone who shares a lot of those friends with you, even if they, themselves, aren’t your friends.
This immediately got me thinking about social networks, and how it might be possible to use this kind of method to find people that are more “similar” to each other based solely on who they’re friends with. Then I thought about using this idea for the so-common suggestion feature that these kinds of platforms usually have. One could use the similarity between two people to figure out whether or not they might now each other. Ok, truth is that this is probably over engineering the thing, and this feature was probably implemented using something simpler and more along the lines of the PageRank algorithm: if a person is friends with a lot of people who you’re friends with, chances are they’re your friend too. Or, even more probably nowadays, the network is given to a machine learning algorithm that decides the probability that a person knows any other person and then uses that probability to decide whether or not to suggest the first person to the second. And this actually brings me nicely to a last point and a good one to end this post: why not just use machine learning everywhere and be done with it?
Well, in the case of social networks and search engines, this move has probably been done years ago, but for bioinformatics one thing is very important to keep in mind: a lot of the works done are used to inform or decide how to proceed with further work in biology, many times being a way to decide what experiments to make. What’s more, a result isn’t easily evaluated: if you have to rank 1000 genes and you can only test 20 at a time it takes a long time to figure out if the top 10 produced by the computational methods really were the top 10. This is why the methods in bioinformatics are more often white-box models, that is models whose functioning can be well understood. This is why even when machine learning models are used, they have to be well justified in why they produce the results that they produce, otherwise it becomes a matter of faith in the method.
Anyways, for the better part of the last couple of months I have been learning about this small part of the world of bioinformatics, a world that I had, previously, no great pretension to dive into and learn about, but I have been amazed at how interesting I’ve been finding the connection between the biological problems and the theoretical-computer-science solutions, so it’s a story I thought to share.