A Beginner’s guide to Big Data, Analytics and Cloud.

I have been hearing a lot about these buzz words in the title for about couple of years now, and luckily had couple of opportunities to work on them over the past 6 months or so, thanks to my own consultancy, which made me read a few books and articles on these subjects that got  me ticking to know more.  With certainly no claim to being an expert in these areas, I have managed to gain some fair understanding that I thought would share here purely for education purposes, so that everyone who hears these buzzwords knows what it is all about and can manage to have a good conversation around it.  I must definitely thank the references below as most of the writing here is just a highly edited summation of the details found in them, to keep your reading more at the layman level.  Consider this as a crisp primer for the uninitiated from both the technology and the consumption angle.

     Business Process Outsourcing (BPO) that made India its right-sourcing capital during the first decade of this century has slowly moved on to the shores of Philippines and Vietnam.  Now it seems that everyone is talking about the Analytics and Cloud wave to have hit India, either through the typical right sourcing to analytics companies here or as a captive analytics center for some big multinational company.  Certainly I see the value being created in this wave to be higher than the BPO wave, and looks like India can establish its credential using its early mover advantage.  Lots of big names like HCL and IBM have got major contracts to maintain the entire IT departments of the Fortune 500 companies on a cloud model and this is a growing area.

     Digital economy(sales)  is ready to surpass physical economy.   Nowadays, all organizations are asking what their customers want and what do they generally do?  With your private information on any social media not exactly private as you think, they want to know who your friends are and what do they like? Who influences you and whom do you influence?  They have large quantities of the data in the world to analyze these information from and design a product or a feature to a product that you are bound to be happy with. Companies are thinking on their feet, in real time, very quick to react  to the feedback you have given them… at least good companies strive to do this and are placing their bets in this direction.  It is no more about a group of customers or a cross section of people that they want to study, but they want to know YOU as a customer individually.  Oh how lovely you feel!

     Analytics is being  used in both B2C and B2B  but the former is more challenging than the latter because predicting end consumer’s behavior to buy, which is usually emotional and irregular, is touch.  Businesses buy or consume in a more regular and rational fashion using usually a well-known process.  What makes the B2C modelling much harder is the fact the data here is more complex due to its volume and variants as more than half the data is ‘unstructured’.


    This Big DATA is just ‘data that is quite large that cannot be processed by conventional methods’.  McKinsey defines Big Data as large data sets that cannot be captured or analyzed by typical database software tools. So, Today’s Big Data may not be tomorrow’s Big Data as the tools would have caught up to analyze today’s Big Data tomorrow but Big Data of tomorrow would be orders of magnitude higher than today’s data so that the same problem remains.   Hence if it safe to say that we are just at the beginning of an explosion of a DATA world.   Big Data is all about the Internet of Things, social and mobile put together.

   The industry has defined Big Data across three V characteristics:  Volume, Variety, Velocity and sometimes a fourth gets added – Veracity. The volume is measured by the sheer size of the data, the variety talks about the assortment of data (structured vs unstructured) and the velocity about the speed at this data get created or processed. Veracity is the one that talks about the accuracy of these huge data, trust behind these data sources and how to take off the noise to arrive at decent useful information that makes good business sense.  The source of data can be either machine generated like sensor data, web log data, transaction data etc.  which are structured and satellite images, scientific data, multimedia which are unstructured data,  or human generated like survey input data, click-stream data, etc. which are structured data and emails, social media data, SMS etc. which are unstructured data.  Each one of these unstructured data can be an analytics domain independently, like text analytics, and lots of research is going into them. Usually structured data are stored in some sort of a table in a RDBMS and can be queried through an SQL.

     Traditionally we know of only the ‘structured’ data – the ones that can put into a database.  For the past few years, thanks to the explosion of social media  and smart phones, we have ‘unstructured’ data in the form of text(emails)/SMS, multimedia (audio, video), (A)GPS and other location based data, data from sensors , etc.  that seems to be imploding daily.  These ‘unstructured’ data are the ones that are becoming to be less private because you like to share them across the social platforms and the corporations want to have a strong direct relationship with you based on these data. They want to do everything they can to acquire new customers and retain and cross-sell to existing customers.  If you sneeze, the corporations catch a cold – this is how close they get to you.

    Analytics is the way in which corporates handle these complexities and speed in data to arrive at a business value that gives them the competitive advantage.  Analytics is just an interface between these large data and the business model.   It uses mathematics to derive meaning from data.   Most of the analytics has its roots to Google, Yahoo and Amazon who are considered pioneers in these and the technology being used.  In the earlier days, they just used to work on ‘samples’ or a smaller subset of the data, discarding all the outliers, and do some predictions.  Nowadays, with the availability of affordable storage, networking and computing power and even pay-as-you-use models, all the generated data gets analyzed to arrive at deeper and broader insights.   Since all the decisions are getting to be more data-centric, it is imperative there is proper transformation and cultural change across the corporation in terms of all the people, the process and strong leadership

   Big Data analytics have moved from being descriptive (based on past information using statistics – Business Intelligence to understand what happened) to inquisitive analytics (why it happened) to being predictive (used past information to predict future outcomes- Data mining and forecasting for what is likely to happen)   to being prescriptive (used past information to direct future results – optimization to arrive what should happen).  The world has moved from models created by small ‘samples’ to using ALL the data to create more complex models and simulate evolving scenarios. All these outcomes of information management in the form of reports, dashboards or animated visualization gets up-levelled to the senior leadership team to arrive at some qualified decisions which becomes the baseline for the  way ahead for corporations.    The talent that is required to do all these modelling are essentially a combination of data scientists with good maths, statistics and technology background and business managers with good economics, behavioral science and social skills.

   Cloud is just a means to provide shared computing resources that are pay-as-you-go and in the IT jargon, it is often referred to as XaaS where X can stand for I or H, P, S, etc.  IT services are seen as utilities and one pays only for the time the resources are being used, hence cloud is also referred to as Utility computing.  Infrastructure as a Service (IaaS, Hardware as a Service – HaaS) is the most common of all cloud services that delivers all computing resources on a rental basis, Platform as a Service (PaaS) is a means by which tools and middleware gets integrated with IaaS to provide a comprehensive consistent platform, and Software as a Service (Saas) is an application that gets created and hosted by the developer in a multi-client mode and will sit on top of a PaaS or a IaaS.   Cloud, be it private which means owned and operated by the organization itself or public which means owned and operated by a vendor or hybrid which is a combination of both Private and Public, is essential for Big Data.  Examples of Iaas would be Amazon EC2-cloud Compute service and Rightscale, of PaaS would be Microsoft Azure and of SaaS would be a CRM like Salesforce.com.  Google has also introduced Data as a Service (DaaS) where one can use the cloud to store and retrieve data.  Cloud computing still has some nagging issues of security, privacy and standardization (or lack thereof) which are slowly falling in place, and the old IT organization and the CIO roles are getting transformed taking this new paradigm into effect.


    There are many Big Data technologies being used but the most common today is the Apache Hadoop framework which is an open-source platform for both storage and processing of all data variants. The two critical components of Hadoop are the Distributed file Systems (HDFS) used for storage and the Map Reduce which does the analysis on the data, both in the distributed sense.

     MapReduce was designed originally by Google that distributes the problem and later aggregates the result in batch mode. Google developed Big Table as their distributed storage system from where Hadoop derived the HDFS.

    Hardware, networking and storage have become more affordable now and are constantly getting cheaper to enable distributed computing in a big way.  Cloud gives you all these through subscription based service, with no upfront capital or maintenance costs. 

     Open source software is key and was made prevalent by Google through its Android mobile OS and is the key forward for any new technologies to be embraced quickly – the eco-system builds up around this open source efficiently and quickly, thus able to deliver all sorts of solutions for a very low cost.   The smaller companies seem to be more agile in delivering a solution for a customer need than the big software vendors and this is creating competition where size does not matter. The software has moved from a classic licensing model to a royalty based model to an annual fee based model thereby benefiting the end user who always has the latest updated version to work with.

   Distributed computing is a fundamental technology that allows independent computing resources to be networked seamlessly together across a huge geographical area to make it look like one single coherent environment. Computing resources that are being shared can include computing entities to memory to networks to storage, but they all have to work together to execute a program.   Over the years, distributed computing has evolved  from mainframe computing where there was a large computer using multiple processors with massive IO operations used for batch and transactional processing, to Cluster computing where several cheap commodity machines were connected by a high bandwidth network and controlled by specific software tool for parallel computing, to Grid computing which is an evolution of clusters where the grids are actually  an aggregation of geographically dispersed clusters connected by Internet and users can ‘consume’ resources just like any other utility.

   Distributed computing can be regarded as a super set of parallel computing, the latter implying a tightly coupled system of mostly homogenous components sharing the same physical memory or shared memory.   Distributed computing encompasses all architectures that use heterogeneous computing elements not necessarily co-located.  The differences between these two types are getting blurred as these two terms indeed gets used loosely to mean the same thing – both are used to perform multiple activities in parallel.    Since in Big Data, the data complexity is high due to its volume, variants and distribution, and the computational complexity may also be high, distributed heterogeneous computing fits well for statistical models, and simulations. Cloud technologies support Big Data well by providing large computing resources on demand, providing large storage for keeping these large data and providing frameworks for optimized processing of large amount of data.

     The foundation of cloud computing is Virtualization that separates the resources and services from the underlying physical system- here again, this logical split can happen at the server end through a thin software layer inserted into the hardware that contains a virtual machine monitor (VMM) or Hypervisor, at the application level to make it OS independent, at the memory level where the memory gets decoupled from the server, at the networking level through a SW that just makes a pool of connectivity available or at the storage level – this level of abstraction that virtualization gives  just provides the relevant information needed and hides the exact details which may not be relevant, and makes applications portable across different hardware and software environment.  Although not meaning the same, this software abstraction is more or less similar to the green-font HW machines called ‘XTerm’ used by DEC and SUN during the 1980s that front-ended for their servers there were at the back for computing. The most common technologies used here are Xen, VmWare and Microsoft Hyper-V.


   Analytics has become prevalent in some key areas now and is slowly changing the way we do business:

Financial-Banking and Insurance – perhaps the prevalent users of analytics and early adopters as well

Credit Card Fraud:   The transaction record of the customer is validated against the customer records and his/her past transactions, their travel schedules (getting access to travel sites from where they did the booking) and place of transaction to identify if there is any abnormal activity, as they are transacting in real-time.   There are certain rules set for each customer based on his/her history that the transaction gets checked against.  If some transaction is believed to be ‘suspicious’, then more ‘verification’ process is added to the transaction to make it more secure.

Credit Risk analysis:  Banks wants to play safe to ensure they can retrieve the loan from their customers – they look at past credit history against your name to see if you are a ‘safe bet’.  Thanks to the credit rating agency like Crisil which does this as their main line of business, the information of all credit transaction of all kind is available to the banks and loan-giver to verify the details and distribute a loan or give you a credit card or line of credit.

Insurance Risk analysis:  Right now, your vehicle insurance premium is based on the city you live, the risk of the neighborhood you are in, and your driving points against you and prior claims made.  In the USA, few insurance companies are generating the premium based on INDIVIDUAL customers and customized to them as a pay-as-you-drive insurance policy.  The onboard telematics sends feeds to your insurer on your braking and acceleration habits, distance you travel, and the roads you frequently travel on (using GPS) – thanks to these various sensorial data, higher premium is charged for more irresponsible driving.  This in turn serves both purposes – makes insurance companies more profitable and also betters one’s driving habits.  A shining example of not only where the ‘rubber meets the road’ but also where the ‘engine meets the wallet’!


     The biggest bang for the buck for analytics, in my opinion, would be in two areas – healthcare which impacts everyone’s life, and in retail to understand customers better.  Healthcare comes today at a cost and is heavily dependent on the facilities of the hospitals or clinics you are getting the treatment in, and the knowledge of the doctor attending to you.  Healthcare is one critical industry like power where the government needs to ensure it is affordable to all its citizens, and at the same time must be the best available there is to all.

       For all this to happen, a good start would be a health record of the patient available electronically across the nation and the globe. This would carry a history of ailments, conditions, surgery and medications of the patient and the regular health check-up results – this is the Electronic Health Record (EHR) available in the US and other countries. The second would be the availability of all clinical trials that are in process or already FDA-approved, side-effects data of all medications, common diseases data prevalent in certain parts of the world and definitely the insurance data of the individual.  With these two together, any doctor from anywhere in the world can give guidance to the best and optimum cure and care for the patient, best medicine from any pharma company for a particular condition, and the best insurance plan for an individual and his family based on the risks they carry. Data drives most of these integrated decisions now, along with the doctor’s experience to suggest a remedy – compare it to the yesteryears where the former data would not be available.  This also further progress into tele-medicine where a solder injured in the battle is in an operation tent with medical gadgets streaming data to experienced doctors sitting elsewhere to guide the surgery procedure and to have him get out of danger quickly.


     All your purchase patterns and transactions are being collected and analyzed carefully to send you targeted advertisements with e-coupons, to aid companies do location based marketing, to help companies get data on leaving customers and where they are going to and why, in managing the effectiveness of an ad campaign, and in knowing details of acquired customers to improve cross-selling.  The better they know the customers, the better would be their sales in an industry with thin margins.

      The other areas in retails where big data analytics is already in use are in inventory management, logistics optimization, merchandize assortment and pricing optimization, fraud and loss prevention and vendor rationalization.

    Classing examples of analytics are Amazon “you may also want” prompts and Netflix “what your friends thought” of movie suggestion, both of which shows good results for the retailer.


      Many of the travel sites collects the log files from all the searches made by the users, and based on your desired preferences will strive to increase their bookings ratio. They would also have data from the text analytics report from your TripAdvisor reviews and based on what you like and do not like, and based on your past history on their site and other sites, will be able to give out optimized  flight and hotel options taking together the inputs you had given based on budgets and time.


      Volvo along with Sweden’s Transportation department  is using cloud service for car-to-car communication to warn the drivers ahead of icy and slippery roads, thus making safety a priority.  They collect the data from the sensors (ESPs) fitted inside their cars – ESP stabilizes the car as well as sends signals of hazardous road conditions through the mobile network to the cloud.  This real-time information is shared with the cars behind that are to use the same road so that they are pre-warned about the actual condition of the road and this information compliments any blanket weather warning that the drivers automatically get updated on.


   Major part of advertising is the reach and conversion that one gets through any forms of media, be it mobile, TV, Web or the classic print.   Advertising is what brings money to the media houses.  Despite the numerous ads that come on any websites, only a few gets clicked and only a small percentage of these clicks actually turns into a purchase. The marketing world is always challenged with how an ad can be more effective so that the hit ratio increases. Now with the digital cable and dish TVs clearly revealing your viewing patterns, your online purchases and shop transactions revealing your buying pattern, with the website having a history of your visits in some format, and with the operator knowing what Value-added services you have enrolled in, and with the world knowing what paper you read, all these combined through analytics would clearly describe a ‘path-to-purchase’ pattern to enable the media houses to focus their ads appropriately. It would not too long before ads stream into your TV or mobile that is customized based on your likes.   

    We already have News websites that customize your viewing page automatically based on your interest as this data is already collected and analyzed based on your previous trips to the website.


   The business problems that get tackled here through analytics are classified into three buckets: 

  • Sales and Marketing to understand their sales force effectiveness and resource optimization, market assessment and competitive analysis
  • Research and Development for clinical trials and reporting to FDA, safety analysis for the product, and licensing
  • Pricing and contracting for inventory and logistics management, and for setting up contracts and buybacks and rebates etc.

     The other applications that are prevalent, some of which are being used by you daily without being aware they are Cloud based ,  are Google Docs, Gmail and Yahoo Mail, wearable health devices that has sensors that routinely monitor vital patient data and feeds back to the hospital or doctor who can take action based on any anomalies immediately, gene profiling and protein structure modelling that was done using community cloud from research institutions, use of satellite image processing used by several countries now for natural disaster management, opinion polls during elections, online document storage like Dropbox of iCloud by Apple, all the social networking sites like Facebook and Twitter, online gaming and casino gambling predictions.

Transformation in the future

    How do you feel if some complex tool used by a company predicts your next behavior with reasonable accuracy?  How can companies use the data you provide and analyze them to make you BUY?   How can healthcare be more focused to your particular problem and provide the best care at the cost you want? How you get the best travel package suited for you and your family based on your likes and dislikes that would enhance the memories of the travel?  How can your insurance be tailor made for you based on your own defensive driving habits and your history of no claims? How can the banks give you the best bang for your buck by automatically understanding your financial goals and delivering a better return for you as a privileged customer?  How can airlines make you fly with them frequently by enhancing your particular travel experience every time?  

     Big Data and its associated analytics are used to take on each customer as a time and enhance their experience.  We can still use the old route and use the 80/20 rule that says that one can easily draw effective 80% of the conclusions and decisions based on the top 20% of the overall customer data. The choice is clear.


  • Big Data, Big Analytics – Michael Minelli et al , Wiley, 2013
  • Big Data for dummies – Judith Hurwitz et al, Wiley, 2013
  • Mastering Cloud Computing – Rajkumar Buyya et al, McGraw Hill, 2013

Many thanks to the reviewers of this blog and their valuable feedback – Vishoo, Venki  and John, all of them from either analytics or e-commercebackground.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.