Ad Hoc Data Analysis on Big Data Sets
The potential for ad-hoc analytics is getting interesting with technologies such as Hadoop to facilitate queries on big data sets. In this short article you'll see its potential through an example of creating a 360 degree view of a customer. You'll also learn why ad hoc analytics using an on-premise Hadoop deployment limits performance, scalability, and usability and discover that the cloud is a real game changer for facilitating fast, interactive queries accessible by business analysts with infinite capacity to scale.
What is ad hoc analysis on big data sets?
Ad hoc analytics is the discipline of analyzing data on an as-needed or requested basis. Historically challenging, ad hoc analytics on big data sets versus relational databases adds a new layer of complexity due to increased data volumes, faster data velocity, greater data variety and more sophisticated data models.
What is driving ad-hoc analysis on big data sets?
Organizations are experiencing an increasing need to enable ad hoc analytics on big data sets to optimize their sales and marketing initiatives, discover new revenue opportunities, enhance customer service and improve operational efficiency.
Let's look at the role of ad hoc analytics on big data sets to achieve a 360-degree view of customers for an organization trying to understand why its customer churn has increased. By querying its structured, internal data the company can determine things like products losing customers, price changes that might have impacted defection, and changes in customer service metrics. But these only tell part of the story.
An extended 360-degree view using ad hoc analytics on big data sets allows the organization to bring in additional unstructured information, both internal and external to the company, to understand other factors relevant to customer churn. Data like social media comments, the results of customer satisfaction surveys, call detail records for customer service, and complaints received via email help the company fully understand and respond to its high rate of customer churn.
Why is ad hoc analysis difficult with on premise Hadoop?
Most businesses processing ad hoc analytics on big data sets use Hadoop because it's designed to handle huge volumes of data across many nodes that can be added as needed. It leverages parallel processing across commodity servers, making it more affordable to scale than other options for big data. Plus, Hadoop accommodates any data format, requires no specialized schema and provides high availability.
However, on-premise Hadoop deployment presents many challenges for ad hoc analysis. Here, we'll focus on just some of the major ad hoc analysis challenges.
- What customers are saying about your company?
- How customers interact with you?
- How well you serve customers?
- How to improve every interaction
Armed with this type of information, businesses will not only understand how customers make decisions, but allow marketers to divide markets into smaller groups. This process is called micro-segmentation, which is a more granular form of segmentation that usually separates potential customers per the demographics or psychographics. More refined segmentation allows marketers to create more targeted and effective ads and messaging.
Batch Processing
Hadoop and its primary programming language, MapReduce, are designed for batch-oriented processing of big data sets. Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. It's considered one of the de-facto tools for Hadoop since it provides a SQL-based query language that makes it easy to query big data sets. However, queries performed with Hive are usually very slow because of its reliance on MapReduce.
To gain real-time interactive query functionality, organizations must use real-time processing engines such as Apache Spark and Facebook's Presto alongside Hadoop. Unfortunately, these open source tools can be very difficult for some organizations to deploy and support.
Inelastic Processing
With on-premise deployments, fixed clusters mean that ad hoc data queries can easily run out of capacity or take way too long to process. Thus, companies either limit the number of queries they run, try to avoid processing queries during times of peak usage or spend too much money because they are over provisioning capacity to guarantee acceptable performance under any condition.
Specialized Skill Requirements
Creating and executing ad hoc analytics in an on-premise Hadoop environment requires developers and data scientists with specialized MapReduce, Pig and Hive skills. And, users need to obtain technical assistance to start and stop clusters every time they run a query.
Why is ad hoc data analytics easier in the cloud?
Big Data as a service (BDaaS) solution available on Amazon Web Services, Google Computer Engine and Microsoft Azure that removes ad hoc analysis obstacles associated with on-premise Hadoop. In fact we advocate "Everything as a Service" MapReduce, Hive, Pig, Oozie, and Sqoop, plus Hadoop Spark and Presto, open source cluster computing frameworks for real-time interactive queries on data stored in Hive, HDFS, HBase and Amazon S3. Our solution allows users to launch and provision Spark or Presto clusters and start running queries in minutes.
Contact us to learn more