“There are an estimated 44 zettabytes of data across various systems altogether. And it will double in the next 2 years”
Recently, our Chief Scientific Officer Dan Bogdanov and STACC CEO Kalev Koppel visited an Estonian business-focused radio Äripäev to talk about various issues concerning the abundance of data.
The discussion revolved around the quality of data today and how accessible it really is. Various companies are trying to approach customers more and more personally, but the main obstacles are regulations and the risk that data falls into the hands of criminals. Listen to the discussion in Estonian here.
“There are an estimated 44 zettabytes of data across various systems altogether. And it will double in the next 2 years,” said Kalev Koppel.
1 zettabyte is 1 sextillion (10^21) bytes (or 1 trillion gigabytes). To put it into perspective, scientists guesstimate that we have, very roughly, 7 quintillion (10^18) sand grains on our planet. So, what are we doing with all this data, and all the data that will accrue in the upcoming years?
Every organisation produces a lot of different data during its operating times. They keep collecting the data without a certain point and what to make of this data, forgetting to question the quality of the data as well as if it’s usable within regulations. And often, some organisations don't seem to get anything out of it – they just believe that any and all data might become useful to them at some point. Since data has become a universally valuable asset for any company, there has been an increased demand for data scientists – a profession that was not long ago even not so desirable.
Naturally, the next step would be to sort and filter out all the existing data to delete everything that is not worth keeping. But this calls for a tremendous resource and time consumption. So, can we appoint a computer for this task? And could data science become fully automated?
The short answer is no. “Working with data has its limitations and so does the machine learning. Theoretically, machine learning is supposed to adapt to our world, but things constantly change. And they might change so that the base data that the machine is working with becomes irrelevant,” said Koppel.
Nevertheless, there has been a giant leap for the AI in data synthesis, precisely in art – a field which is generally thought to be dominated by humans. “There is now an AI technology that can synthesise images based on any text prompt you can come up with – the algorithms have been trained on a huge database of existing art and literature,” commented Dan Bogdanov. And this principle of data synthesis can be very useful for us in data science. “Synthetic data that is based on real data can help us disclose certain data which would normally be deemed too delicate,” he continues.
And at the end of the day, we have to ask ourselves – how can we protect the unfathomable amount of data on our hands? Currently, we are operating based on consent that users give us, and, obviously, various data protection regulations. “But the same regulations do not apply on every region and we cannot predict how the future governors might act. Due to that, there need to be certain “handcuffs” in place for data protection to ensure the ethical data analysis,” Dan said.
Where do we stand now? “Currently, our expectations regarding data usage are managed and most companies understand the certain limitations that are set for both the data and the analysis of this data,” commented Kalev Koppel. Thanks to the better understanding of how the data can be useful, businesses are beginning to use it more and more to automate certain processes. The key takeaway from the discussion is that we have to start taking control of the data we produce and store – make it a habit, both for personal and business use, of curating and minimising the storage of useless data.