Big data, big risks?
Big data is hot news. The opportunities for analysing huge amounts of unstructured data are highly valued in industry and science, yet there is also concern about data protection. ETH information technology professor Donald Kossmann researches and teaches in the field of big data and is convinced that the benefits will outweigh the risks.
Mr Kossmann, what do
you think has been the most fascinating application of big data to date?
It’s difficult to say, let me think for a moment... oh
yes, Google Translate. Over the years, linguists have tried to develop
functional models for language without any great success. Today, Google
Translate delivers better quality than any of these models, simply based on
experience by comparing existing translations from the Internet.
One often gets the
impression that big data is interpreted in different ways by different people.
How would you define big data and what is new about it?
Big data is first and foremost the automation of
experience. Conventional IT automates processes: you first consider how
something might work at its best and then develop a programme that automates
precisely this process. With big data, it doesn’t stop there –the process is
continually adapted in line with experience.
What are the technical
fundamentals required for this?
It is becoming more affordable to save huge amounts of
data and computers are becoming more and more powerful. At the same time,
companies such as Google developed completely new software structures in the
1990s, meaning that they were no longer reliant on a mainframe computer to
analyse large amounts of data and were instead able to fall back on hundreds or
thousands of small computers. Practices that previously took place in Google’s
labs have become accessible to everyone over the last few years as a result of
open source developments.
You also give lectures
on big data. To what extent has big data changed teaching and research at ETH?
From a technical perspective, big data is not a
revolution; its fundamental technologies have been known for a long time. The courses
offered have not completely evolved in accordance with this. However, we have
considerably expanded areas such as “Machine Learning” at ETH, i.e. the
algorithmic and mathematical foundations for big data analyses. Initial attempts
to develop completely new data science courses are currently being carried out
at American universities. We believe, however, that broad, well-grounded IT
training is still very much in demand even in the age of big data. The
continued high level of industry demand for our graduates proves our point.
What has changed in
terms of research with big data?
There are more and more collaborations with industry
in this area. At the same time, there has been a considerable increase of
interest from other scientific disciplines, in biology, for example, where we
are supporting the SystemsX.ch systems biology initiative or in sociology where
we are involved in the FuturICT project.
In 2010, you founded
the big data spin-off Teralytics together with an ETH graduate. What do you
offer your clients?
A platform for big data analyses, i.e. software that
can process and analyse very large amounts of data in real-time. Very often,
these types of analyses then run on hundreds of computers at the same time.
And who are your
clients?
I would rather not name them, because big data still
has quite a negative connotation in the public sphere – contributed to, of
course, by the current debates surrounding data analysis by the NSA in the USA.
But big data wrongly has a bad image: it does after all have many useful
purposes.
Such as...?
When we succeed in developing new, effective
treatments by analysing anonymous health data, for example, in order to combat
types of cancer that cannot currently be treated, then public support will soon
grow. There are risks, of course, but sometimes the benefit of new technologies
is so great that society has to simply run these risks.
Where are the biggest
technical challenges?
Efficiency is an important area. The amounts of data
are growing far faster than our computer and storage capacities. It is not always
practical today to analyse all of the data available from an economic
perspective and given the energy required to do so. The question, therefore, is
how much data we have to analyse to ensure significant results. We will
continue to research improved real-time data analysis. Our aim is to obtain the
information that is necessary for making decisions more quickly, something that
is crucial in situations of crisis, for example, natural catastrophes. And, of
course, data protection is an on-going problem for us. We are developing new
hardware structures to encrypt and aggregate data so that we can guarantee that
even insiders are unable to draw conclusions about individuals.
Many big data
applications today access data that is freely available on the Internet. Is
technology that sanitises and encrypts personal data used here?
No, this is the sole responsibility of the user. In
the case of services such as Facebook or Twitter, you agree to the companies
checking your data. These companies can do what they like with this data. But
it is, of course, up to you what you make available on the internet.
Are we on the way to a
“post privacy” society and an all-encompassing public sphere, as predicted by
certain authors?
No, privacy is a basic human need. Perhaps young
people value data protection a little less now but this generation will also
learn to better protect its privacy using the technical possibilities out
there. In addition, new, different platforms will become available on the
Internet that will allow users more privacy than the current options such as
Facebook. Overall, I am optimistic that people will ultimately use big data to
their benefit.
Donald Kossmann has been Professor for Computer Science at the Institute of Information Systems at ETH Zurich since August 2004.
READER COMMENTS