You may well be a big fan of the hit TV series “The Big Bang Theory” if you want to know the workings behind Data Science, since obviously this is purely geek stuff.
But, have you ever wondered how each time you play an audio-CD in your CD-rom drive, the likes of Apple iTunes or Windows Media Player would automatically retrieve the name of the artist, complete with the album title and song details?
You may say, it’s all about the vast database online applications such as CBDB and Gracenote that they have in their systems. Simple enough.
Gracenote, which now owns CBDB, alone has over 130 million tracks kept in its database vault. Describing it as vast is indeed an understatement. Alternately, data mining the millions of songs it has to retrieve for the details of the audio-CD in your CD-ROM drive in 15 seconds or less is literally data science at work.
So, what is data science?
Data science can be described as the science of extracting information from larger datasets – now identified in the enterprise storage world as big data – and try to present or bring something of use to non-data experts — usually the business sector, where their end products usually fall on us, the people on the other end, the consumers.
In a McKinsey Quarterly interview in 2009, Google’s Chief Economist Hal Varian, stated that the “The sexy job in the next ten years will be statisticians… the ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
So, who are now these people into this science of extracting information from big data and try to present or bring something of use to consumers? These people are called – tada! – data scientists.
At the EMC World 2012 in Las Vegas, we were introduced to some of these unique individuals who make a living out of other people’s information – out of people’s habits in shopping, their tastes, their moods, the way they tweet a message, their entire social network behavior, practically everything.
When Facebook suggests people who may be related to you, or you may want to be friends with, the top social networking website was not simply guessing on who you may want to be connected to. Same thing happens within LinkedIn or Twitter. As for Amazon.com or even our very own Sulit.com.ph, the other photos that appear on the page are not random selection of other products being promoted online but rather other stuff that you may actually be interested in. Data science is mining not only what you may want but also what you may need.
In the case of data scientists, they mine, dig deeper and scrape, do the stats and analyze all the data, and for all its intents and purposes, try to accurately define or at least find a tipping point of their findings. Every individual’s exposure online, or even offline for that matter, is a potential information for data scientists to scrutinize and analyze.
A summit for data science
Coinciding with EMC World was a summit for data science backed by no less than Greenplum, an enterprise solutions company that focuses on large-scale data warehousing and analytics.
The gathering brought together different personalities from the academia, the social enterprise, start-ups, and the public sector to help EMC World attendees to explore and define the path to move forward in a data-driven world.
Most of what was talked about by the speakers was focused on the science part of data, its application, and different statistical issues in relation to the analysis of big data.
Some of the speakers at the summit include: Nate Silver, a statistician and political forecaster for the New York Times, Michael Brown, chief technology officer at ComScore, Tony Jebara, co-founder of Sense Networks and a professor at Columbia University, Michael Chui, author of the McKinsey Big Data Report, Hadley Wickham, an author and educator, and Nora Denzel, senior vice president, big data marketing and social at Intuit.
Although there were a variety of opinions and approach in handling today’s data, one thing is certain among the people who shared their insights at the summit: there is a need to extract the correct and proper information from the abundance of both structured and unstructured data; data mining the value data obtained from purely raw data.
“It’s a messy data set (unstructured data), there’s a lot of junk in it” said Ted Neely, founder and chief executive of Network Insights, commenting on the huge amount of data available particularly on the Internet . “Constantly evolving the classification systems we developed overtime to be able to handle the new colloquialisms that are causing noise versus signal… and understanding overtime the qualifications between noise and signal will give you better indicators (in qualifying the bad data sets).”
Unstructured data is defined as a generic label for describing any corporate information that is not in database which could be textual or non-textual. Data produced by email messages, PowerPoint presentations, Word documents, collaboration software and instant messages are considered as textual unstructured data, while non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files.
Today, accurate data gathering that will bring value to business and consumers are done through extensive data mining, clear statistics, and advanced analytics, which have given rise to a new field of science. And data scientists have become an integral part of this field, which is no wonder that the enterprise world, besieged by today’s big data sets, is fast taking notice.
Big Data, what is it?
Big data generally refers collectively to the massive growth of stored data in the computing world.
In 2010, the entire digital universe has accumulated around 1.2 zettabytes, according to data provided last year by enterprise storage solutions provider EMC.
But before we could go any further, let’s qualify what a zettabyte is: A zettabye is equal to a billion terabytes, or to put it in a more overwhelming context — it’s equal to one trillion gigabytes! (most consumer-based computers nowadays are in the gigabyte-storage range)
Furthermore, in 2011, EMC said that the digital universe has already produce over 300 quadrillion files. And it gets even better, or in other words, bigger: by 2020, the same digital universe has already grown to 35 zettabytes.
And to store all those data, enterprise storage products are already available from top-tier companies such as EMC, IBM, HP, Dell and NetApp. While cloud computing – which involves delivering hosted services over the Internet – has become an important solution that plays a vital role in solving many company big data storage problem; online cloud infrastructure are also already in place from the likes of Amazon, Google, Microsoft just to name a few.
In addition, aside from cloud computing, virtualization – the creation of a virtual (rather than actual) version of something, such as an operating system, a server, a storage device or network resources – has also become another fixture in big data management. EMC for their part believes that virtualized infrastructure will become more pervasive and will form the foundation for cloud computing.