At 3Cloud we’re getting a lot of questions from our clients about social sentiment analysis, one of the hottest solution areas for Big Data. In our latest blog article on Real-world Big Data, we’ll provide a basic understanding of how tools and technologies for unstructured data such as Hadoop are used in combination with SQL Server to gain insights from social media. These solutions help marketing, sales, and executive teams gain insights into what customers are saying about their company, products, and services.
Today’s consumers are heavily involved in social media, with users having accounts on multiple social media services. Social media gives users a platform to communicate effectively with friends, family, and colleagues, and also gives them a platform to talk about their favorite (and least favorite brands). This “unstructured” conversation can give businesses valuable insight into how consumers perceive their brand, and allow them to actively make business decisions to maintain their image.
Historically, unstructured data has been very difficult to analyze using traditional data warehousing technologies. New cost effective solutions, such as Hadoop, are changing this and allowing data of high volume, velocity, and variety to be much more easily analyzed. Hadoop is a massively parallel technology designed to be cost effective by running on commodity hardware. Today businesses can use Microsoft’s Hadoop implementation, HDInsight, and SQL Server 2012 to effectively understand and analyze unstructured data, such as social media feeds, alongside existing Key Performance Indicators.
Where do I start?
Posts can be downloaded and loaded into Hadoop using familiar tools like SQL Server Integration Services, or purpose built tools like Apache Flume. Data can often be gathered for free directly from a social media services public application interfaces, though sometimes there are limitations, or from an aggregation service, such as DataSift, which pulls many sources together into a standard format. Social networks like Twitter and Facebook manage hundreds of millions of interactions each day. Because of this large volume of traffic, the first step in analyzing social media is to understand the scope of data needed to be collected for analysis. Quite often the data can be limited to certain hash tags, accounts, and key words.
Once the data is loaded into Hadoop the next step is to transform it into a format that can be used for analysis. Data transformation in Hadoop is completed using a process called MapReduce. MapReduce jobs can be written in a number of programming languages, including .Net, Java, Python, and Ruby, or can be system generated by tools such as Hive (a SQL like language for Hadoop that many data analysts would be immediately comfortable with) or PIG (a procedural scripting language for Hadoop). MapReduce allows us to take unstructured data and transform (map) it to something meaningful, and then aggregate (reduce) for reporting. All of this happens in parallel across all nodes in the Hadoop cluster.
A simple example of MapReduce could map social media posts to a list of words and a count of their occurrences, and then reduce, that list to a count of occurrence of a word per day. In a more complex example we can use a dictionary in the map process to cleanse the social media posts, and then use a statistical model to determine the tone of an individual post.
So Now What?
Once MapReduce has done its magic, the meaningful data now stored in Hadoop can be loaded into an existing enterprise business intelligence (BI) platform or analyzed directly using powerful self-service tools like PowerPivot and PowerView. Customers utilizing SQL Server as their enterprise BI platform have a variety of options to access their Hadoop data, including: Sqoop, SQL Server Integration Services, and Polybase (SQL Server PDW 2012 only).
Having social media data loaded into an existing enterprise BI platform allows dashboards to be created that give at a glance information on how customers feel about a brand. Imagine how powerful it would be to have the ability to visualize how customer sentiment is affecting top line sales over time! This type of powerful analysis allows businesses to have the insight needed to quickly adapt, and it’s all made possible through Hadoop.