June 19, 2012

5 Questions to ask before embarking on a Big Data Project

When an organisation embarks on a Big Data project, it’s a journey laden with lots of landmines.  Even if  one is unable to manage one risk, it has the potential to derail a Big Data project even if one was able successfully evade other risks. As they say “A chain is as strong as its weakest link”. So what are the weak links in a Big data project? What are the 5 key questions to ask before embarking on a big data project so that one is able to mitigate the risks from these weak links and steer it towards a successful implementation. Based on real life experiences from the trenches Flutura Decision Sciences has outlined out top 5 Big Data questions which we feel are extremely vital to pose upfront before spending dollars on a big data project
The “Dent” test: What is the $ denting business use case we are enabling?
Many engagements are “Data forward as opposed to being use case backward”. It’s very important to fully understand the $ impact of the use case being instantiated and the business value of the new data pools which are being streamed for analysis. For example how much increase in revenue are we expecting when creating a recommender engine using Hadoop cluster to increase the breadth of purchase for online customers? If so a value tracker can track the incremental revenue attributed to the recommendations converted into a sale from the big data solution? Along with identifying the use cases one of the first tasks on hand is to identify the pools of big data which are lying untapped and answering the following questions-  Do I have big data within my premises? How do I identify one? What can I do with it?
The “Intersect” test: Which event data streams are we finding value in?
Often value lies at the intersect of multiple data streams. For example, Flutura worked for an engagement in the telecom industry there are lots of devices at the periphery emitting events – cell phone towers, firewalls, routers, switches, application logs , OS logs etc. Whenever an adverse event happens –say a denial of service attack on a provider it is important to triangulate the effect across routers logs firewall logs and application logs.
Similarly in a real life engagement Flutura worked for an online travel agency (OTA), the value of new scenarios being instantiated by the Hadoop cluster was at the intersect of Apache log files which recorded each and every click event of the user along with the cookie id & IP address which was overlaid on top of search events ( from city, to city, date, no of passengers flying) which was recorded in a mySQL search log which was then co-related to actual booking /payment events which was stored in an Oracle database. The new Hadoop cluster enabled the organisation to compute look to book at a customer level as opposed to an aggregate corridor level. So which intersect of event high velocity / unstructured event streams do we need to look for value in?
The “Tool Components” test: How do I know which components are relevant for my use case – Columnar db’s, document databases, machine data tools, complex event processing etc ?
The Big data landscape is laden with a lot of tools – columnar databases (Infobright, vertica), appliances ( Hana, Exadata), Complex event processing frameworks ( S4) , Algorithmic libraries ( R, Mahout etc ),  Machine data tools ( Splunk), Document databases ( CouchDB, Lucene,MongoDB) etc. There is very less guidance on which scenarios require which kind of constructs. So a very pertinent question would be what is the decision tree I need to use to arrive at the architectural constructs which are required to deliver my business use case
The “Chunk” test: Are we delivering a high impact business output in 60-90 days?
In most organisations with traditional DW mindsets it’s not uncommon to find the first deliverable being exposed to business 8-12 months from the start date of a project. While executing Big Data project, it makes sense to “chunk” the use cases into 60-90 day deliverables so that it builds momentum from the business and accelerates the much needed funding to set up 

The “Co existence” test:  What’s your co-existence strategy with traditional BI solutions ?
Even though new age big data solutions have dramatically increased performance expectations and information handling capability, that does not mean the end of traditional BI solutions. One must have a co-existence strategy with traditional BI solutions as those data processes have a lot of embedded business rules and one should not spend money recreating them.  So How can the new age Big data solutions co-exist with existing BI solutions and other components in our existing IT ecosystem?

Data Scientists at Flutura Decision Sciences have seen the importance of managing weak links in a Big Data implementation by asking 5 important questions. To summarize the 5 key questions to ask are
1.       “Dent test”         : What is $ denting use case using the big data stack?
2.        “Intersect” test                : Which event data streams are we finding value in?
3.        “Tool Components” test              : Which Big data Components are required and when?
4.        “Chunk” test     : Are we delivering a high impact business output in 60-90 days?
5.       “Coexistence”   : How do we co-exist with existing DW/BI solutions in place?
More than a century back Louis Pasteur made a profound statement – “Chances favour a prepared mind”. Flutura Decision Sciences strongly believes this statement is true even in today “data soaked” world when organisations embark on Big Data solutions and the 5 key Big data questions pave the way for a successful implementation.