According to Chris Snijders, Uwe Matzat and Ulf-Dietrich Reips’ 2012 publication in the International Journal of Internet Science, “Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software.” It often follows a “collect now, sort later” philosophy.
Relational databases are generally unable to thoroughly categorize the massive amounts of data inputted into the system upon its addition. Over time though systems learn how to correctly categorize the data within their database set. Fei Fei Li explained at a recent TED talk how computers were being taught to read images. Google photos is really really good at this. When I upload a video to youtube it not only duplicates the file to different render sizes, but also scans the text of it to create closed captioning. This further processing after upload allows the big data within the system to become more useful over time.
Thomas Sowell’s classic economics book Knowledge and Decisions discusses the proper flow of information acquisition processing and use enable decision making. He also warns that the time required to collect information across a complex system eventually leads to a trade off of the usefulness of that information. Using this as a basis I would like to present that big data can be absolutely useless to an organization or very useful depending upon the tools available. Peter Maasswrote an article on this problem on the issue with NSA data collection. If you’re searching for a needle in a haystack, having a larger haystack only makes the needle harder to find.
Large data sets are also liabilities. Hospitals are great targets for hackers this year. Why? Because there’s emphasis on hospitals to be HIPAA compliant which doesn’t necessarily equate to an audit of the technical compliance for best practices. The information is sensitive and easily exploitable. The larger your data set it the greater target you become.
Although storage is considered inexpensive today compared to prices in the past. Storage still comes with a cost. Large data sets cost money to purchase and maintain. If the data on them isn’t being utilized these datasets are nothing more than liabilities. The data storage debate is about to enter a new phase in 2016 as we see ZFS get added to Ubuntu while legal questions continue to get posted. ZFS has already emerged as the file system of choice for large data and its adoption by one of the most popular server distributions in the world will be a game changer but only if the legality of the move is secure otherwise businesses will be hesitant to implement a solution that may die in a courtroom sometime down the line.
Big data can be extremely useful to businesses in giving them an edge over competitors if they are capable of managing the data and using it in a way that provides value to their customers. Amazon is a great example of this. Their suggested purchase features are based on complex algorithms using an individual’s purchasing history, previous shipping addresses and thousands of other data points.
Netflx has over 83,000 categories for its content and knows who’s watching what and for how long. Now that they are creating their own content this gives them an edge on the traditional TV studios (NBC, CBS etc) because they assume less risk about whether or not their customers truly want the content they are creating. They already know what types of shows their audience is interested in watching and can deliver content based upon that need.
If you plan on using big data, you need to have the tools ready to digest it in order to make it effective.