As big data use cases proliferate in telecom, health care,
government, Web 2.0, retail etc there is a need to create a library of big
data workload patterns. Flutura has created a Big data workload design
pattern to help map out common solution constructs. There are 11 distinct
workloads showcased which have common patterns across many business use
cases.
A Big data workload design pattern is a template for identifying
and solving commonly occurring big data workloads. The big data workloads
stretching today’s storage and computing architecture could be human generated
or machine generated. The big data design pattern may manifest itself in
many domains like telecom, health care that can be used in many different
situations. But irrespective of the domain they manifest in the solution
construct can be used. Once the set of big data workloads associated with a
business use case is identified it is easy to map the right architectural constructs
required to service the workload - columnar, Hadoop, name value, graph
databases, complex event processing (CEP) and machine learning processes
Data Workload-1:
Synchronous streaming real time event
sense and respond workload
It essentially consists of matching incoming event streams
with predefined behavioural patterns & after observing signatures unfold in
real time, respond to those patterns instantly.
Example: In registered user digital analytics scenario one specifically examines the last 10
searches done by registered digital consumer, so as to serve a customized and highly
personalized page consisting of
categories he/she has been digitally engaged. Also depending on whether the
customer has done price sensitive search or value conscious search (which can
be inferred by examining the search order parameter in the click stream) one
can render budget items first or luxury items first
Similarly let’s now switch over to a health care situation. In hospitals patients are tracked across three
event streams – respiration, heart rate and blood pressure in real time. (ECG
is supposed to record about 1000 observations per second). These event streams
can be matched for patterns which indicate the beginnings of fatal infections
and medical intervention put in place
Data Workload-2:
Ingestion of High velocity events
- insert only (no update) workload
This is a unique workload widely experienced while ingesting
terabytes of sensor and machine generated data. These are insert only workloads
with no updates or lookup workloads
Example: Ingesting millions of micro events streaming from
log files , Firewall alarms, sensor data, and the click stream data torrent. It
is estimated that a Boeing flight has the potential to generate 200 terabytes
of data on a single flight. Data from vibration sensors, temperature sensors,
strain gauges, position data, speed etc … Imagine ingesting all this data for
all the flights !
Workload-3: High
node Social graph traversing
This is a unique workload where finding interrelationship
around nodes in a network is vital. This workload is computation and read
intensive as node statistics need to be computed and children of a node need to
be read dynamically.
Example: In the telecom industry where there are millions of
pre paid and post paid subscribers, the CDR (Call Detail Record ) consists of terabytes
of switch logs which contain important patterns regarding inter-relationships
between subscribers. This can be mined using graph databases to understand if
certain new gaming applications or apps which are downloaded are getting viral
with friends and family circles by traversing computation intensive graph
traversals
Similarly in social websites, millions of interrelationship
are stored as a graph and one needs to
traverse large complex graphs and map
key influencers who are capable of influencing a marketing outcome or to recommend
a friend to expand the social network to its edges
Workload-4 : ‘Needle
in a haystack’ workloads
Looking for a small string or attribute value from the terabytes
of data across multiple attributes is a very common read workload specifically
in machine data use cases
Example : While
processing terabytes of sensor data from
engines one may look for specific temperature and Rpm conditions behind an
automobile breakdown. Similiarly security specialists investigating a network breach
incident may wade thru steams of granular log data from multiple devices before
homing in on crucial events vital to giving clues about the cause of an attack
Workload-5 : Multiple event stream mash up & cross
referencing events across both streams
Usually events in isolation may not have significance but
taken together as a string of events occurring in a timeline there importance
amplifies especially across multiple event streams
Example : In telecom
there is a need to mash up firewall events on a timeline along with router
events to detect patterns in a distributed denial of services ( DDOS) attack
Workload-6 : Text indexing workload on large volume semi structured data
While processing semi structured data tools like Lucene
needs to index the strings
Example : In medical scenarios, one need to identify all
encounters of a patient with the doctor which has specific disease keywords and
then analyze the health outcome of the patient
Workload-7 : Looking
for absence of events in event streams in a moving time window
While most pattern detection consists of behaviour/patterns exhibited,
it also makes sense to look out for ABSENCE of specific events across moving
time windows as they may alert to a risk or a revenue potential
Example: In an online travel website, its important to sort
thru the avalanche of log file data flowing in and isolate search instances
which did NOT result in a booking event. So we are traversing a moving time
window where there is a sequence of search events which do not have a book
event.
Workload-8: High
velocity, concurrent inserts and updates workload
Its very common to have thousands of users across the world
update or insert based on booking or gaming applications
Example: Thousands of flight orders bookings, payment
transactions online
Workload -9 : Semi
& Unstructured data ingestion
It is said that 80 %
of the worlds information is unstructured and bringing it into repositories to
analyze them may yield previously untapped intelligence
Example: Medical records – xray, ecg results need to be digitized ( unstructured )and
doctors observations on the patient ( semi structured ) need to be recorded
Workload-10 : Sequence
analysis workloads
It is very common to chunk pieces of events together and
examine if there are patterns which tell a story regarding the problem context
Example : In genome
and life sciences, DNA sequencing a crucial. Similarly in the telecom industry
there are a lot of dropped calls from a switch which needs to be analyzed using
sequence analysis processes to understand events leading to that outcome of
interest.
Workload-11 : Chain
of thought ad hoc workload for data forensic work
This workload is primarily triggered by power users or
analysts who are the ‘Data Marco Polos’ exploring large oceans of data with questions previously
not thought off. They cast a wide net and often come up with few patterns. But
when they identify a pattern it has huge repercussions for the organisation
Example : Pricing
analysts want to investigate consumer behaviour before they price a service. They may have a sequence of hypothesis to test
in a certain sequence before arriving at the optimal price point. Similiarly Infrastructure
specialists want to confirm or reject hypothesis regarding effect of newly
launched apps on digital traffic by sequencing a specific set of hypothesis
regarding app engagement and its effect on network infrastructure load
So far we have seen a
draft articulation of workload patterns. It is our endeavour to make it collectively
exhaustive and mutually exclusive with subsequent iteration.
As Leonardo Vinci said “Simplicity is the ultimate
sophistication” …. Big data workload design patterns help simplify the decomposition
of the business use cases into workloads. The workloads can then be mapped methodically
to various building blocks of Big data solution architecture. Yes there is a
method to the madness J