The whole Big Data is greater than the sum of its parts


How to win in the application economy by finding business value in information and potential data.

LittleBlogAuthorGraphic  David Hodgson, February 19, 2015

“Imitation Game,” the current Oscar-nominated film about code buster Alan Turing, reminds us that from the early days of computers, we have thought of “data” as something organized into rows and columns; something that could be structured. A newspaper or book clearly contained information, but it wasn’t really data because to a computer it appears to have a random organization; it was unstructured and couldn’t be processed usefully by computer programs.

The revolution underlying the application economy is the emergence of new tools, and enough processing power, to glean value from unstructured information thus turning it into data. We call it, “Big Data,” because this revolution gives access to much more data than we had before. The code has changed forever.

Myth busting

There is a myth that, “Big Data analytics,” is all about NoSQL databases and unstructured data. A lot of the clever analysis that companies are doing with high volume, transactional data is achieved with structured data alone. This is Big Data too. The two archetypal cases most used today are:

  • Recommendation engines: used real-time or post-sale to suggest additional purchase options based on what other similar buyers have bought
  • Fraud detection: usually real time to alert on unusual behavior pattern data (access point, transaction time, purchase type, etc.) that might be indicative of fraud.

The revolution that has happened here is based just upon the availability of lots of compute power. This gives the ability to process complex queries that used to take hours, in seconds or short enough periods to afford an effective real-time value.

In the mainframe world IBM achieved this by offloading complex DB2 SQL queries to their DB2 Accelerator – the ex-Neteeza device that attaches directly as an extension to the mainframe to receive data and processing instructions at lightning speed in a way that is transparent to the applications spawning the requests. The downside is that this technology is expensive and only for the well-healed elite. Hadoop democratized the possibility of large-scale compute power by making it available through massive parallel consumption of commodity servers. The cloud providers democratized access to it with their IaaS and SaaS offerings offered at cheap prices with a pay-for-use business model.

The sum of the parts

Many innovative and valuable analyses are being done purely using unstructured data. For instance, imagine analyzing text data like Tweets, Facebook posts and emails sent to customer service, companies might visualize whole new emerging problems that they could build products to solve. Minimally they might be able to validate or eliminate new ideas and “fail fast” as the axiom of lean innovation teaches us. Earlier than the competition does is all that’s needed; he who has the best data scientist wins!

However, the most immediate ways to augment or create new business processes are probably achieved by combining the two types of data, structured and unstructured. The previous use of social data, or person based feeds, is called a, “sentiment analysis.” By combining data from the product catalog, with sales data and a sentiment analysis, companies can quickly get an early grasp on the shape and size of dissatisfactions. This allows product managers to make changes and then use the same data sources as a feedback loop to see if the investment fixed the customer issues.

And the increasingly important use case of fraud Detection can be hugely enriched by the use of unstructured data. Logs of movement through the Internet, or other infrastructure, can reveal deeper patterns than just transaction origin can alone. Activity on social media might indicate buying (or other) behavior patterns preceding a fraud or help illuminate correlations with post-fraud selling activity. These are just two simple examples.

The imitation game

Some analysts tell us that 80 percent of the data a company has today is unstructured data. As the IoT becomes a reality, unstructured data will become more like 99.9 percent of the data a company has. The winners in the application economy will be those that can find business value in all this information and potential data.

In a survey from last December it was revealed that 67 percent of large companies are in production with Big Data analytics.

Although this is on the high side for survey results, the chances are that if you are not doing it yet then be assured that your competition is.

You better start playing the imitation game quickly and start busting the new code of Big Data for yourself.

Big Data – It’s a zoo out there


Even the data analytics darling of today could be extinct tomorrow – how to tame the beast that is Big Data with the right IT skills and tools.

LittleBlogAuthorGraphic  David Hodgson, February 12, 2015

Hadoop became a rock star in 2014, emerging into mainstream IT from relative obscurity, and being recognized by analysts in formal market analyses. But equally important to Hadoop itself, are the plethora of other tools in the ecosystem, also fueled in the main, by the influential Apache Foundation.

The revolution in data analytics we see today just would not have happened without the confluence of open source software and very cheap processing power, whether that’s cloud or commodity servers in-house. Those two forces were like the finger of God in the software world, kicking off the equivalent of a Cambrian explosion of engineering creations.

My illustration below gives a brief overview of some of the major parts of the Hadoop ecosystem, but there are actually many others; this was all I could fit easily on one PowerPoint slide for a recent talk I gave on the subject.


Large animal pictures

The peculiar, perhaps Indian sounding name of Hadoop, was taken from the creator’s, daughter’s toy elephant hence also the logo. And following this theme, Mahout is an Indian term for the elephant keeper, the person who leads and maintains control over the elephant. And Ambari is the name of a special sort of howdah, the seats or thrones that elephants can carry on their backs in India.

It’s this complexity with the implication of arcane knowledge known only to insiders that is still holding back many companies from being successful. Yes, we have yet another IT skills shortage and we will be fighting over the best talent in this area for a while yet. Probably until more tools emerge that either bring order to chaos, or entirely remove the need for the lower level knowledge.

With all this complexity it really is a zoo out there hence the need for Apache ZooKeeper, a product that allows you to track the configuration data for all these components and ensure that you maintain the connections as you move systems and components around or move new projects into production.

Natural selection or genetic engineering?

Great diversity is always indicative of creative change – the evolutionary forces are certainly at work here. Many new species and varieties appear constantly and certainly some of the creations we see today will be extinct tomorrow. Preserved perhaps, stuffed and inactive in a museum of software, but no longer a part of the living zoology.

We are already seeing a decline in the use of the initial MapReduce process and the growing use of SQL layers to process Hadoop data. Even Hadoop itself, today’s data analytics darling, could be extinct tomorrow, displaced in the dominant gene pool by Pachyderm: software that is related only by the inference in the name. The latter is an exciting new startup that uses Docker containers to store the data and is built on CoreOS for the processing infrastructure.

Perhaps saying, “It’s a zoo out there,” is an understatement and really it is like the actual jungle where only the fittest will survive this initial bloom of new life. Hearing this, the timid may well decide that they don’t want to come outside to play; they will stay indoors with their RDBMS and traditional data warehouses. I suspect the Dodo did that!

If you want to avoid becoming a fossil yourself you cannot hang back; now is the time for IT to learn this stuff and for line-of-businesses (LOBs) to start demanding access to it via their IT departments, or to simply bypass IT and start playing with it on the cloud somewhere.

IT as zookeeper or ringmaster

So how does IT tame this jungle, circus or whatever metaphor you like best for this wild ride? How can they manage this diversity and both give their LOBs the tools that will drive a competitive edge for the company, and contain costs at the same time?

CA Technologies has the answers and will be talking about them at the Gartner BI and Analytics event in March. Be there or risk the likelihood of becoming a stony artifact of your former self!

How are you taming big data within your organization? Leave me a comment below.