Sunday, July 28, 2013

Big Data Stream

Stream computing is a new paradigm necessitated by new data-generating scenarios, such as the ubiquity of mobile devices, location services, and sensor pervasiveness. A crucial need has emerged for scalable computing platforms and parallel architectures that can process vast amounts of generated streaming data.

In static data computation (the left-hand side of attached diagram), questions are asked of static data. In streaming data computation (the right-hand side), data is continuously evaluated by static questions.

Let me give a simple example.  In financial trading platform, applications are written traditionally to analyse the historical records in the batch mode.  Meaning, we preserved the data in the data ware house.  Based on the user request/query, the result is produced/returned back to the consumer.  It is the first use case.

With big data streaming technology, the requests (like market trend of IT stocks) are pre built  On the arrival/streaming of the data, the results are published to the prescribed subscriber/consumer.  Isn't it too cool to taste the technology?

Thursday, July 25, 2013

Storm at Yahoo

Yahoo! is enhancing its web properties and mobile applications to provide its users personalized experience based on interest profiles. To compute user interest, we process billions of events from our over 700 million users, and analyze 2.2 billion content every day. Since users' change interest over time, we need to update user profiles to reflect their current interests.

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases.

Yahoo! big-data platform enables Hadoop applications and Storm applications to share data via shared storage such as HBase.  Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.

Sunday, July 21, 2013


Hadoop, the clear king of big-data analytics, is focused on batch processing. This model is sufficient for many cases (such as indexing the web), but other use models exist in which real-time information from highly dynamic sources is required. Solving this problem resulted in the introduction of Storm from Nathan Marz (now with Twitter by way of BackType). Storm operates not on static data but on streaming data that is expected to be continuous. With Twitter users generating 140 million tweets per day, it's easy to see how this technology is useful.

Storm is more than a traditional big-data analytics system: It's an example of a complex event-processing (CEP) system. CEP systems are typically categorized as computation and detection oriented, each of which can be implemented in Storm through user-defined algorithms. CEPs can, for example, be used to identify meaningful events from a flood of events, and then take actions on those events in real time.

Thursday, July 4, 2013


Splunk is getting on board with a new Hadoop-based application it cheekily calls Hunk. Hunk takes Splunk’s popular analytics platform and puts it to work on data stored in Hadoop. Businesses that use Hadoop can now use this for exploration and visualization of data. Itz key features are:
  • Splunk Virtual Index: Splunk virtual index technology enables the “seamless use” of the entire Splunk technology stack, including the Splunk Search Processing Language (SPL), for interactive exploration, analysis and visualization of data stored anywhere, as if it was stored in a Splunk software index.
  • Explore data in Hadoop from one place: Hunk is designed for interactive data exploration across large, diverse data sets on top of Hadoop.
  • Interactive analysis of data in Hadoop: Hunk enables users to drive deep analysis, detect patterns, and find anomalies across terabytes and petabytes of data.
  • Create custom dashboards: Hunk users can combine multiple charts, views and reports into role-specific dashboards which can be viewed and edited on laptops, tablets or mobile devices.
Splunk has 5,600 customers, which includes half of the Fortune 100. It says the new Hunk product will target both new and existing customers, as long as they use Hadoop.