Friday, October 3, 2014

The topological structure of big data

One interesting talk at DICE2014 was a talk by Mario Rasetti on understanding the bid data of our age.

You may wonder what does this have to do with physics, but please let me explain. First when we say big data, what are we really talking about? The number of cataloged stellar object is \( 10^{21}\). Pretty big, right? But consider this: in 2013 there were 300 billions email sent, 25 billions SMS messages made money for phone companies, 500 million pictures were uploaded on Facebook. In total from those activities mankind produced in 2013 \( 10^{21}\) bytes. And the every year we produce more and more data. 

How much is \( 10^{21}\) bytes? How about 323 billion copies of War and Peace. Or 4 million copies of all of Library of Congress. In four years it is estimated that we will produce \( 10^{24}\) bytes which is larger than Avogadro's number!

Now how can we get from data to information to knowledge and then wisdom? From computer science we know to lay all this data sequentially and people considered vector spaces for this. But does this make sense? For example, if we take a social network like Twitter what we have are simplicial complexes. What Mario Rasetti proposed was to extract the topological information from those kinds of large data sets. In particular he computes the Homology groups and Betti numbers which were discussed on this blog on prior posts, and the reason is that the algorithmic complexity is polynomial in computation time.

We know that if we triangulate a manifold and omit one point we obtain different topological invariants just like puncturing a three dimensional balloon results in a two dimensional surface. Therefore in computing the Betti numbers we get fluctuations but as more and more nodes are included into computation the fluctuations stabilize. 

The link with physics and Sorkin's Causal Set theory is obvious and the same techniques can be applied there. However Rasetti did not go into this direction and instead he cited the application of the method to biology. In particular, he was able to clearly distinguish if a patient took a specific drug vs. a placebo from the analysis of the brain MRI image which looked identical to the naked eye. 

Recently I saw an article on what Facebook sees in posting patterns when we fall in love:

Now all this looks really scary. Imagine the power of information gathering and topological data mining in the hands of (bad) governments around the world. And not only governments. Big companies like Facebook are abusing the trust of their users and perform unconscionable sociological tests by manipulating advertising for example. In the biological area, human cloning is rejected because the general population understands the risks, but the understanding of big data and the ability to mine it for correlations and knowledge is badly lagging behind the current technical ability. More violation of privacy scandals will occur before the public opinion will put pressure to curb bad behavior of abusers of trust.

No comments:

Post a Comment