Hadoop gives companies the power to store and analyze information quickly, efficiently and at a lower cost than ever before. It's power and flexibility make it the perfect solution to problems that involve large, complex data sets, and that demand new approaches to processing and analysis.
Hadoopable problem #4: Delivering Search Results
The Hadoopable Problem:
Good search tools have been a boon to web users and the companies they visit online. But as the amount of data online has grown, organizing has become increasingly difficult. Users today are more likely to search for information with keywords than to browse through folders looking for what they need.
Good search tools are hard to build. They must store massive amounts of information, much of it complex text or multimedia files. They must be able to process these files to extract keywords or other attributes to use in searches. The amount of data and its complexity demand a scalable and flexible platform for indexing.
Besides the difficulty of handling the data, a good search engine must be able to assess the intent and interest of the user when a search query arrives. The word "chip", for example, can refer to food or to electronic components - context is vital to delivering useful search results. Delivering meaningful results is dependent on the analysis of user preferences, recent browsing history, and a number of other data sources.
A major online retailer meets the challenge of delivering good search results by building its indexing infrastructure on Hadoop. The platform has scaled easily to the data volume required. Just as importantly, it runs complicated indexing algorithms, and Hadoop's distributed, parallel architecture lets the retailer index very large amounts of information quickly. Search is particularly important to retailers because revenues are highly dependent on search success. If the search system delivers results that are not of interest to the buyer, there is no sale and often the user might choose another retailer. Search relevance therefore drives both profitability and customer retention.
In addition to information about individuals, their history, and their preferences used when building search indexes, effective search engines track user behavior in response to searches themselves (which results were clicked, and which were ignored) and uses it to refine search results for subsequent users and future search results.
The Bottom Line:
Online search is a problem of massive data sets, rapid analysis, and meaningful results. By combining the storage capacity of Hadoop with the ability to analyze data in parallel, regardless of format, large amounts of data can be economically processed to provide results that take into account user behavior, preference and history, and match query results to user needs.
For businesses like online retailers, accurate and meaningful search results drive profitability and customer retention - helping build engaged and satisfied customers.
Hadoopable problem #5: Predicting Utility outages
The Hadoopable Problem:
Energy utilities run large, complicated, expensive systems to generate and distribute power. Each generator, regardless of type, contains sophisticated sensors that monitor output: voltage, current, frequency, and other important operating parameters. Operating a single generator requires paying close attention to the energy source, energy production, and all the data constantly streaming from the sensors to which it is attached.
Utilities operate many generators that are spread across multiple locations. The locations are connected ot one another, and each utility is connected to a public power grid.
Monitoring the health of the entire grid requires capture and analysis of data from multiple energy providers, every generator, and the grid itself.
Failures in energy generation often results in cascading outages as the larger generation and distribution network struggles to contain a surge, or supply a shortfall. One small problem can manifest as a utility outage for tens of thousands of consumers and can be extraordinarily costly for the utility operators.
The volume of data required to predict and prevent outages is enormous. A clear picture of the health of the grid depends on both real-time and after-the-fact forensic analysis of huge amounts of collected data in a variety of formats. Spotting facilities or grid components at risk of failure before they fail, and doing preventative maintenance or seperating them from the grid before they impact service delivery, is critical to preventing costly outages.
One power company uses Hadoop clusters to capture and store the data streaming off of sensors in both energy production and the grid. It built a continuous analysis system that watched performance of individual generators, looking for fluctuations that might suggest trouble. It also watched for problems among generators - differences in phase or voltage that might cause trouble on the grid as a whole.
Hadoop was able to store the data from the sensors inexpensively, so that the power company could afford to keep long-term historical data in usable form for forensic analysis. As a result, the power company can see, and react to, long-term trends and emerging problems in the grid that are not apparent in the instantaneous performance of any particular generator.
The Bottom Line:
While the use of Hadoop in power generation and outage prevention is highly specialized, it has an analog in other complex systems and infrastructure grids. In particular, the same tools and approaches are often used in large-scale data centers or other applications that deliver utility-type services.
In a large data center with thousands of servers, understanding what the systems and applications are actually doing can be difficult. Existing tools don't often scale and operational data doesn't conform to simple formats. IT infrastructure can capture system-level logs that describe the behavior of individual servers, routers, storage systems, and more. Higher-level applications also generally produce logs that describe the health and activity of application servers. Combining all the data into a single repository and analyzing it together can help IT organizations better understand their infrastructure and improve efficiencies across the network. Hadoop can store and analyze log data in virtually any format and can build a higher-level picture of the health of the data center as a whole
Source: Cloudera White Paper