Published: 21.01.09
Risk research

What links Open Source and literature?

The frequency of words in texts, the size of companies and the linking together of components in Linux software distributions show approximately the same mathematical distribution: they obey Zipf’s law. ETH Zurich researchers tested how this happens in Linux programs.

Niklaus Salzmann
The number of packets (y axis) to which more than C links point (x axis). On the double logarithmic scale, all four Debian Linux distributions that were studied yield straight lines with a gradient of approximately -1, which corresponds to Zipf’s law.
The number of packets (y axis) to which more than C links point (x axis). On the double logarithmic scale, all four Debian Linux distributions that were studied yield straight lines with a gradient of approximately -1, which corresponds to Zipf’s law. (large view)

In the first half of the twentieth century, the American linguist George Kingsley Zipf studied how often each word occurs in literary texts. A few words were very frequent, e.g. «the» and «and», but the majority of words occurred only rarely. The resulting pattern could be expressed in figures: the most frequent word occurred about twice as often as the second most frequent and three times as often as the third most frequent, i.e. the frequency of a word was inversely proportional to its rank. This has since been called Zipf’s law.

Scientists have discovered that this distribution holds true – more or less – for quite different systems, e.g. the numbers of visitors to web sites, the size of towns and the size of companies in numerous countries. Researchers suspected that this recurring pattern is associated with the growth process of the systems being studied.

Free-of-charge raw material due to Open Source

Doctoral student Thomas Maillart and Didier Sornette, Professor on the Chair of Entrepreneurial Risks, together with Sebastian Späth and Georg von Krogh, Professor on the Chair of Strategic Management and Innovation at ETH Zurich, have now demonstrated empirically the conditions under which a distribution obeying Zipf’s law occurs. They did this by examining the linking of Linux software packets. Their results were published in the scientific journal Physical Review Letters and mentioned in Nature as a Research Highlight.

In an earlier publication, Sornette had already suggested carrying out an empirical test of Zipf’s law. When searching for a subject for his thesis, his doctoral student Thomas Maillart came across an article about open source software by Sebastian Späth and Georg von Krogh. Maillart realised that this contained data with which the origin of Zipf’s law could be verified.

Linux is an operating system similar to Microsoft Windows or Mac OS. Many versions of it are available to download free of charge via the Internet. Each Linux distribution consists of various software packets which thus represent free-of-charge raw material for the scientists to use in their research. Debian Linux – the distribution studied by the ETH Zurich researchers – comprised only 474 packets in 1996, whereas there were already more than 18,000 in 2007.

Characteristic distribution arises as a result of the growth

The packets are networked by numerous links through which they call one another. First of all, for four versions of Debian, Maillart examined whether the number of incoming packet links obeys Zipf’s law. This was confirmed (see graphic). The scientists then studied how the number of links referring to a packet develops over time. They assumed a proportional growth pattern: the more links that already lead to a packet, the faster the number of links increases.

The evaluation of the Linux packets data showed that the researchers’ model was correct. In new packets, the number of links deviated from Zipf’s law, and the characteristic distribution arose only as a result of the growth of the Linux distribution. A condition that the researchers had used in their model was also confirmed: the fluctuation in the number of links becomes larger as it grows. Consequently, it can drop down to zero again even if it is very large, which, for the Linux packet, means that it is no longer being used.

Conclusions on Entrepreneurial risks

Thomas Maillart describes himself as a risk manager. He says that he had already calculated risks as a Civil Engineering student at EPFL, where these risks were connected with the safety of building structures. He then worked in a company insuring Internet risks. He has now written the paper on Zipf’s law in the context of his thesis on Internet risks at the Chair on Entrepreneurial Risks at ETH Zurich.

Being able to estimate the growth of Linux packets is exciting from an entrepreneurial point of view. However, the significance of the paper extends far beyond this specialist area, because the knowledge applies to all systems obeying Zipf’s law. To the size of companies, for example: by analogy with the number of links pointing to a Linux packet, a company’s size provides no certainty that the company will survive, as the financial crisis has confirmed.

Literature reference

Maillart T, Sornette D, Spaeth S & von Krogh G. Empirical Tests of Zipf’s Law Mechanism in Open Source Linux Distribution. Phys. Rev. Lett. 2008; 101 (218701). doi:10.1103/PhysRevLett.101.218701.

Reader comments: