Do Not DeleteRecently I’ve been talking with a number of customers with similar problems. They have systems that are generating high volumes of transaction-level data that they are unable to collect and store in order to use it for analysis. For example, most manufacturers have sensors on their production lines. Every piece of equipment has one or more sensors that is collecting data about the operation. Information like tolerances, temperatures and cycle times all give valuable insight into how the process is going. Very often, these systems are measuring multiple times a minute or even second. This data is used to keep track of processes as they happen but since the data volume is so high, it is either archived or discarded after it is a week or even just a day old. Because the data volume is so high, it’s not really feasible to keep around 1, 5 or 10 years’ worth of it. But, if you had that deep history, imagine how much insight you could gain into your processes. The ability to do trend analysis to detect anomalies that might be missed in a span of days but stick out over months could indicate places where you could improve processes and quality.

Manufacturing is one example but many other types of businesses have data that they regularly throw away or ignore because they don’t think it’s relevant to business operations or there is too much of it in a format that is hard to deal with. Web server and other software logs usually accumulate until someone decides to either delete them or archive them. How about survey results? Companies do lots of surveys and then act on the outcome. Do you keep the data accessible so that you can combine past results with new ones and analyze them over time? Do you cross-compare them with your sales results? How about clicks, tweets and likes? Do you have customer comments or product reviews on your web site? Can you correlate them over time and show trends in sentiment toward your products and demonstrate how they impact sales?

The promise of the big data is that it will give you access to all of your data no matter what the volume, source or format. But many companies are saying things like “we want to look into big data but we don’t have time/resources/money right now”. OK, so you’re going to do big data in six months. You still don’t have to throw all that valuable data away between now and then. Those log files, process streams and social feeds have value. Don’t just arbitrarily say “we’ll get to it when we can”. Start collecting that data so that when you are ready to try some analysis on it, you’ll already have built up a stockpile.

“But wait!” I hear you saying. “IT told me it’s too expensive to keep all that data around! They told me all that disk space costs too much!” IT has a point. Disk space in the data center is limited and it can be costly to add more. One great way to mitigate that is to utilize cloud storage like Azure Blob Storage. If you’re new to the concept of cloud beyond the buzzword, it really just means computing services that are hosted by a service provider and made available over the Internet. With cloud-based services, you don’t have to buy any infrastructure or software, you just pay a monthly fee. Microsoft Azure is Microsoft’s umbrella for their cloud-based offerings one of which includes data storage. The term “Blob” (written correctly it’s BLOb) is an acronym for Binary Large Object. It just means a collection of binary data that is stored together. So, put that all together and Azure Blob Storage is Microsoft’s cloud offering for storing large blocks of data. Whew!

Cloud storage has a lot of advantages over storage in the data center:

  • There’s nothing to buy up front, you just sign up for an account and pay monthly.
  • You only pay for what you use. You can reserve 100 terabytes if you want but if you only use one, you only pay for one.
  • Data is stored redundantly either multiple times in the same data center or multiple times around the world. You get to choose how safe you want it to be and you don’t have to worry about backups.
  • You can shut it down at any time if you decide you don’t need it and you’re not out any investment beyond what you paid monthly.
  • IT won’t keep asking you when you’re going to delete all those old files.

Using cloud storage is a bit different than using folders on your hard drive or network but not too different. There are a number of tools available that make it as easy as dragging and dropping files around on your PC. One that is easy to use and free is Cloud Berry Explorer. There are also ways to automate the transfer of files into cloud storage that you may want to explore.

Don’t worry about the file formats, just start saving the files and let them pile up. Name them so that you can keep straight what’s in them. If they are log files, they’ll probably be dated anyway. Otherwise, make sure to include the source and date in the file name. An example might be “ProductSurveyResults-20140718.csv”. This collection becomes the beginning of your Data Lake.

Once you have some data accumulated and you’re getting ready to think about using it, the really great news is that that can all be done from the cloud too!

  • In small scale, Microsoft’s Power Query can read data directly out of Blob storage into Excel where you can build data models with Power Pivot and visualizations with Power View and Power Map. All of these tools fall under the heading of Power BI.
  • If you’re ready for a big data solution, Hadoop is the way to go. Microsoft offers their HDInsight Hadoop distribution in Azure. HDInsight is Hortonworks Hadoop optimized for the cloud. It is easy to set up and can read directly from Blob storage.
  • If a larger Hadoop cluster is what you’re after, it is possible to create a full Hadoop infrastructure in the cloud using Azure Virtual Machines and Blob storage.

It’s time to start thinking about big data. If you’re not, your competitors are. If you’re not ready to get started, at least stop throwing all of that valuable data away!

If you have questions about big data, Microsoft Azure or Hadoop or would like to have a conversation about your data analytics, please contact us.