The answers to your Big Data questions are everywhere


I follow up my previous post with questions that you should be asking yourself when it comes to getting more out of the data your organization probably isn’t using – unstructured data.

LittleBlogAuthorGraphic  David Hodgson, March 3, 2015

Where is unstructured data? As they said about 60s radio series character Chickenman, “He’s everywhere!”

It’s inside your organization under your nose, and outside your organization ripe for the picking like low hanging fruit and in strange places needing a degree of pre-processing and parsing.

In my last post I talked about the power of combining structured and un-structured data to unlock the business value realized by recent revolutions in data analytic technologies. But what is unstructured data? If 80 to 90 percent of the data that you have today is unstructured what is it? Where is it? How can it be used? And how can you get more?

Given that 80 percent of the valuable business data that most companies use today is structured data, this means they’re not getting business value from the majority of the data available to them right now.

You must accept this challenge: the winners in the application economy will be those that find business value in unstructured data and use it in combination with the structured data that undergird their existing mission critical business systems.

Being the Biggest loser is not a positive statement in the world of Big Data!

The big picture: what is it and where is it?

What is it? Any source of information that doesn’t have a defined format or structure intended for generalized data processing as rows and columns is probably unstructured or semi-structured data and could be valuable to you, including a report, a log, an image, a form or any sort of document or file.

Even Excel files that are visually organized in rows and columns are considered semi-structured data for the purposes of this discussion – only the Excel application knows how to do anything with it. In the big picture, this isn’t very useful.

Inside your organization think of where the most valuable, prescient data really is: could it be in notes people take, emails that people exchange, Excel spreadsheets they create, logs of their activity, CRM records or social media interactions with customers?

Sure the reference data in your business systems is critical, but the data that is driving daily business decisions and longer-term strategic decisions may be elsewhere. Could you access that? Would it be valuable if you could?

Outside your organization, where is the data that describes your adjacent markets, or the next innovation that you will (or should) either create or capitalize on? Where is the low hanging fruit? Only you know really but could it be on news websites, in discussions on social media, stock price reports or in SEC filings on company websites? Could it be on a government website that lists foreclosures, or competitive bids, in weather reports or news reports? It could be anywhere accessible via the Internet.

Sure your employees could read all this stuff and process it mentally to your advantage, but can they really and do they? How could you get hold of it in an automated way if it was valuable? This is the low hanging fruit usually available through published APIs, for purchase, or collectable using simple and free open source tools.

The big secret: how do I get hold of it and where do I keep it?

Unstructured data usually requires new tools and processes to extract intelligence and deliver business value. Absent of structure you need ways to extract or create context and metadata about the data: what is it about, when was it created, when and by whom.

For the goal and purpose we are discussing this metadata cannot be created manually. To be useful, these processes need to be scalable and in real-time. They also need to be relevant to business and they can’t cost more than their derived value.

Enter the magic of open source tools and commodity processing power either on-premise or in the cloud. Without these ingredients you would not be able to get hold of Big Data or store it in a cost effective way.

Forget your conventional data warehouses – while they’re not going away, they’re also not your go-forwards tools. In the age of the Internet of Things (IoT), you will be looking at one or several of the new file systems and so-called NoSQL databases that are available today.

The table below gives some idea of the popular offerings available and what you might use them for. What is it you want to do?


How can I be the ‘biggest winner’?

You can’t be the ‘biggest winner’ without asking the right questions and finding the answers. All the questions above, and the specific questions for your business will help you uncover your own secret sauce. Will you mine data others have collected, or create a new collection for yourself?

Take a look at what other companies have done:

Twitter has got millions of people to enter their thoughts on every subject under the sun. LinkedIn has got people to enter their career summaries and their contacts. Nike found personal health data. Some clever electronic medical records vendors have found drug usage data. Every web-commerce site is a potential source of profiling data and geospatial data.

To ensure you don’t get left behind, ask the questions, get engaged with the potential for your business and carve out your winning, differentiated position in the application economy.

After all, the answers are everywhere.

Image credit: Sergei Golyshev

The whole Big Data is greater than the sum of its parts


How to win in the application economy by finding business value in information and potential data.

LittleBlogAuthorGraphic  David Hodgson, February 19, 2015

“Imitation Game,” the current Oscar-nominated film about code buster Alan Turing, reminds us that from the early days of computers, we have thought of “data” as something organized into rows and columns; something that could be structured. A newspaper or book clearly contained information, but it wasn’t really data because to a computer it appears to have a random organization; it was unstructured and couldn’t be processed usefully by computer programs.

The revolution underlying the application economy is the emergence of new tools, and enough processing power, to glean value from unstructured information thus turning it into data. We call it, “Big Data,” because this revolution gives access to much more data than we had before. The code has changed forever.

Myth busting

There is a myth that, “Big Data analytics,” is all about NoSQL databases and unstructured data. A lot of the clever analysis that companies are doing with high volume, transactional data is achieved with structured data alone. This is Big Data too. The two archetypal cases most used today are:

  • Recommendation engines: used real-time or post-sale to suggest additional purchase options based on what other similar buyers have bought
  • Fraud detection: usually real time to alert on unusual behavior pattern data (access point, transaction time, purchase type, etc.) that might be indicative of fraud.

The revolution that has happened here is based just upon the availability of lots of compute power. This gives the ability to process complex queries that used to take hours, in seconds or short enough periods to afford an effective real-time value.

In the mainframe world IBM achieved this by offloading complex DB2 SQL queries to their DB2 Accelerator – the ex-Neteeza device that attaches directly as an extension to the mainframe to receive data and processing instructions at lightning speed in a way that is transparent to the applications spawning the requests. The downside is that this technology is expensive and only for the well-healed elite. Hadoop democratized the possibility of large-scale compute power by making it available through massive parallel consumption of commodity servers. The cloud providers democratized access to it with their IaaS and SaaS offerings offered at cheap prices with a pay-for-use business model.

The sum of the parts

Many innovative and valuable analyses are being done purely using unstructured data. For instance, imagine analyzing text data like Tweets, Facebook posts and emails sent to customer service, companies might visualize whole new emerging problems that they could build products to solve. Minimally they might be able to validate or eliminate new ideas and “fail fast” as the axiom of lean innovation teaches us. Earlier than the competition does is all that’s needed; he who has the best data scientist wins!

However, the most immediate ways to augment or create new business processes are probably achieved by combining the two types of data, structured and unstructured. The previous use of social data, or person based feeds, is called a, “sentiment analysis.” By combining data from the product catalog, with sales data and a sentiment analysis, companies can quickly get an early grasp on the shape and size of dissatisfactions. This allows product managers to make changes and then use the same data sources as a feedback loop to see if the investment fixed the customer issues.

And the increasingly important use case of fraud Detection can be hugely enriched by the use of unstructured data. Logs of movement through the Internet, or other infrastructure, can reveal deeper patterns than just transaction origin can alone. Activity on social media might indicate buying (or other) behavior patterns preceding a fraud or help illuminate correlations with post-fraud selling activity. These are just two simple examples.

The imitation game

Some analysts tell us that 80 percent of the data a company has today is unstructured data. As the IoT becomes a reality, unstructured data will become more like 99.9 percent of the data a company has. The winners in the application economy will be those that can find business value in all this information and potential data.

In a survey from last December it was revealed that 67 percent of large companies are in production with Big Data analytics.

Although this is on the high side for survey results, the chances are that if you are not doing it yet then be assured that your competition is.

You better start playing the imitation game quickly and start busting the new code of Big Data for yourself.

Big Data – It’s a zoo out there


Even the data analytics darling of today could be extinct tomorrow – how to tame the beast that is Big Data with the right IT skills and tools.

LittleBlogAuthorGraphic  David Hodgson, February 12, 2015

Hadoop became a rock star in 2014, emerging into mainstream IT from relative obscurity, and being recognized by analysts in formal market analyses. But equally important to Hadoop itself, are the plethora of other tools in the ecosystem, also fueled in the main, by the influential Apache Foundation.

The revolution in data analytics we see today just would not have happened without the confluence of open source software and very cheap processing power, whether that’s cloud or commodity servers in-house. Those two forces were like the finger of God in the software world, kicking off the equivalent of a Cambrian explosion of engineering creations.

My illustration below gives a brief overview of some of the major parts of the Hadoop ecosystem, but there are actually many others; this was all I could fit easily on one PowerPoint slide for a recent talk I gave on the subject.


Large animal pictures

The peculiar, perhaps Indian sounding name of Hadoop, was taken from the creator’s, daughter’s toy elephant hence also the logo. And following this theme, Mahout is an Indian term for the elephant keeper, the person who leads and maintains control over the elephant. And Ambari is the name of a special sort of howdah, the seats or thrones that elephants can carry on their backs in India.

It’s this complexity with the implication of arcane knowledge known only to insiders that is still holding back many companies from being successful. Yes, we have yet another IT skills shortage and we will be fighting over the best talent in this area for a while yet. Probably until more tools emerge that either bring order to chaos, or entirely remove the need for the lower level knowledge.

With all this complexity it really is a zoo out there hence the need for Apache ZooKeeper, a product that allows you to track the configuration data for all these components and ensure that you maintain the connections as you move systems and components around or move new projects into production.

Natural selection or genetic engineering?

Great diversity is always indicative of creative change – the evolutionary forces are certainly at work here. Many new species and varieties appear constantly and certainly some of the creations we see today will be extinct tomorrow. Preserved perhaps, stuffed and inactive in a museum of software, but no longer a part of the living zoology.

We are already seeing a decline in the use of the initial MapReduce process and the growing use of SQL layers to process Hadoop data. Even Hadoop itself, today’s data analytics darling, could be extinct tomorrow, displaced in the dominant gene pool by Pachyderm: software that is related only by the inference in the name. The latter is an exciting new startup that uses Docker containers to store the data and is built on CoreOS for the processing infrastructure.

Perhaps saying, “It’s a zoo out there,” is an understatement and really it is like the actual jungle where only the fittest will survive this initial bloom of new life. Hearing this, the timid may well decide that they don’t want to come outside to play; they will stay indoors with their RDBMS and traditional data warehouses. I suspect the Dodo did that!

If you want to avoid becoming a fossil yourself you cannot hang back; now is the time for IT to learn this stuff and for line-of-businesses (LOBs) to start demanding access to it via their IT departments, or to simply bypass IT and start playing with it on the cloud somewhere.

IT as zookeeper or ringmaster

So how does IT tame this jungle, circus or whatever metaphor you like best for this wild ride? How can they manage this diversity and both give their LOBs the tools that will drive a competitive edge for the company, and contain costs at the same time?

CA Technologies has the answers and will be talking about them at the Gartner BI and Analytics event in March. Be there or risk the likelihood of becoming a stony artifact of your former self!

How are you taming big data within your organization? Leave me a comment below.


It’s a New Year for Big Data analytics


While many people have already rung in the New Year, the Chinese New Year is still a month away, which gives us a bit more time to ponder what lies ahead.

LittleBlogAuthorGraphic  David Hodgson, January 14, 2015

As you may know, the Chinese New Year in 2015 is February 19. On this calendar 2015 is the year of the Sheep, but it might also be called the year of the Goat, or even the Ram. I am not sure what Chinese philosophy would say this means in terms of predictions for the world of data analytics, but for sure we will see many more following the leaders like sheep, much goat-like stupidity and a few charging ahead into new territories like rams!

So, late for Jan 1st, but well in time for Feb 19, here are five of my predictions for the world of data analytics in 2015:

Big Data will become even more talked about

Buzz terms don’t die easily. Look at ‘cloud’! Despite the fact that many find the term, ‘Big Data’ to be ill-defined and simplistic to the point of being meaningless, the industry will continue to use it, and in fact it will become better understood by more people. Both the general public, and people in IT will become more educated in the terms and the tools of modern data analytics, but we will exit 2015 with more hits on a search for articles on Big Data than we start the year with. The term will not die this year.

Every major enterprise will have a Big Data strategy by the end of 2015

Surveys in 2014 revealed that enterprises were ahead of smaller businesses in plans for adoption of the new data analytics technologies. Despite this, there was a lingering feeling that there was a lot of talk and not much real action. A lot of the initial Hadoop adoption was still at ‘science experiment’ status and not production business systems. But 2015 will see a rush forwards as some firms get competitive advantage and others rush to catch up. By the end of 2015 every large enterprise will have dedicated real money and staff to this area and have on paper a definite strategy for how they are going to extract business value from new, unstructured data sources.

Data agility will become an aspirational driver of Big Data strategies

Gone are the days when all a company’s business data could be stored in one place, be it a mainframe database or a distributed data warehouse. As we learn to derive business value from new sources of data we need to become agile in both access and storage methods.

Gartner’s aspirational vision of a ‘Logical Data Warehouse’ that encompasses governance and use over many data sources will become the norm for data management strategies in 2015. In 2014, established players tried to head this off by combining Hadoop into their platforms (e.g. Teradata partnering with MapR). However, the adoption of new tools is as much about lowering the cost of IT and providing future choice. Data agility and vendor lock-in are antithetical concepts. This year will see a further erosion of the incumbents as enterprises make strategic choices that achieve their immediate goals and afford them flexibility for the future.

Hadoop will remain the predominant tool for Big Data strategies

In 2014 Hadoop went from relative obscurity to become a mainstream IT tool, with several supported distributions and many ancillary tools like Apache Drill. In 2015 Hadoop, its derivatives and its associated product ecosystem will remain the most used tools for new data analytics projects. Although groups will also increasingly deploy Cassandra and MongoDB for specific needs including real-time analysis.

The Internet of Things will start to shape Big Data strategies

Every time I watch CNN on the latest phase of the recovery of AirAsia flight 8501, I marvel about our inability to track planes and collect data from them. Surely now the airlines and the FAA will be forced to solve this in 2015 and figure out how to collect and store data continuously on all flights. Will we as the public settle for less now we know the truth about how planes can be lost?

Similarly in 2015 all large enterprises will start to think about large-scale data collection and storage strategies as part of their future directions. The data from mobile devices, things tagged with RFID trackers and the gradual instrumentation of everything will provide too much potential business value for them to ignore it. And the Internet of Things will move from a fuzzy, somewhat joke concept, to a solid part of the landscape that IT departments must manage.

So much happens in one year in the world of IT that five predictions will hardly scratch the surface of what will unfold. I remain excited to see what really happens but confident that my five predictions will be in the mix somewhere. The world of data analytics and data management is changing quickly and perhaps at the forefront of IT evolution. All that said I am absolutely sure we will still be calling it Big Data next December!

Image credit: ILRI



What’s the big deal about big data?


How companies like Nike and Rent The Runway are leaders in using data to transform their business model in the application economy

LittleBlogAuthorGraphic David Hodgson, November 25, 2014

Underneath the varied landscape of the application economy the new raw material is data, and learning to mine and craft is the key to success. 

So what is this data and how can you transform your business?

Structured versus unstructured data

You often hear statements like 70 percent of business data is on the mainframe, or 80 percent of business data is structured data and only 20 percent is unstructured. These statements might be true, but they can mask what’s important for you to know.

Some analysts estimate that 80 percent of a company’s data is unstructured and 20 percent is structured. The apparent disconnect here with the previous statements is explained by the word “business.” The fact is that most companies today have a ton of data, but they are only deriving business value from a small part of it and that part is predominantly structured data.

Deriving value from unstructured data

The companies that are differentiating themselves as we enter the application economy era are those that have learned to derive business value from unstructured data. This has been achieved largely by use of new analytical tools like Hadoop or also by combining analysis of unstructured data with their structured data in platforms such as Oracle, Teradata and DB2.

These might be established companies re-enforcing their existing business models by using analytics and unstructured data to improve their operations. For example, banks can detect fraudulent activity from logs, geospatial data and buying pattern analysis. Another example could be a car company monitoring Twitter activity for a sentiment analysis around a new car model to predict potential recalls.

Other established companies such as Nike have re-imagined their business model by aligning their sportswear to initiatives like health monitoring and data collection and analysis around the concepts known as the quantified self, biometric data and activity logging. Nike consumer technology officer Chris Satchell, who spoke on a panel hosted by CA EVP Amit Chatterjee on day two of CA World’14, said the company got into “wearables” not because it wanted to be in consumer devices, but because it wanted to be closer to its athletes.

And of course disruptive companies like Twitter, Rent The Runway and Kaggle thrive on generating and using data to create new business models that threaten the incumbents. CEO and co-founder of Rent The Runway Jennifer Hyman, who spoke on CA CEO Mike Gregoire’s panel on day one of the show, said: “The customer is willing to give out huge amounts of data if you give her something worthwhile for it.” 

Discover “free” business value

There are several sorts of unstructured data that you could turn into valuable business data with the right collection and analysis techniques. This could be “free” business value hiding right under your nose:

Data you might already have in your organization’s assets:

Log data of various sorts, examples every company has are systems logs of user activity, system events and errors and comments in service desk tickets or customer surveys.

Raw machine data from infrastructure like point of sale networks. Usually a subset of transaction data is loaded into formal databases but the other data items such as teller id might be useful data for analysis.

Data from devices comprising the Internet of Things devices. This data source is still emerging and growing and the bulk of it will be outside your organization but today you might use RFID tracking, you might allow people to bring their own devices or you might have sensors in devices in a manufacturing facility. Collecting data from any of these sources could derive new business value if you set some data scientists on to it!

Data you could add to you unstructured databases for almost zero cost:

Social Data from Linked In, Facebook, Twitter (or your own facility like Salesforce Chatter) might be a rich source of market trend or sentiment data.

Public Data from websites with APIs might be anything from weather reports, traffic data or government statistics. Then there is the textual data that could be mined from sources such as public financial listings and earnings reports.

Where are you at with discovering what data you have and what new data you could collect? What percentage of your business data is unstructured? 

The answers to these questions could well be a leading indicator of your success in the application economy. I’d love to hear your thoughts either in the comments section below or connect with me on Twitter as @dmgh7

Image credit: Elif Ayiter