Big Data : Ride the Apache Spark !!

Presentation3.png

There is so much data out there today that no one can possibly process it all. For example, many companies have the data that can tell them how their customers actually feel, and when and why those customers might switch to a competitor. The problem is that most companies do not know what they don’t know. Data transfer is one of the most pressing problems for companies in the telecom industry today. As data requirements grow from month to month, cost for dealing the mass also goes extremely high.

Fortunately Apache Spark will save the day for those who are savvy enough to use it cleverly. Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was started at UC Berkeley in 2009 and is now developed at the vendor-independent Apache Software Foundation. Since its release, Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Yahoo, eBay and Netflix have deployed Spark at massive scale, processing multiple petabytes of data on clusters of over 8,000 nodes. Apache Spark has also become the largest open source community in big data, with over 1000 contributors from 250+ organizations.

Data transfer is one of the most pressing problems for companies in the telecom industry today. As data requirements grow from month to month, cost for dealing the mass also goes extremely high.SK Telecom dealt with this data deluge problem as well as providing network a quality analytics platform for data scientists and network operators with Spark; how to process real-time network data more efficiently; and how Spark Streaming and Spark SQL, MLib are used for high-speed large enterprise data processing and analytics.

SK telecom network quality analysis has proceeded to provide reliable wireless network and network optimization. For network traffic suddenly increases rapidly and accurately various attempts to resolve the problem have been conducted. Using Spark Streaming, Mllib, Spark-SQL let move the real-time processing from base station quality data based on statistics and set active analysis basis up

Apache Spark in-memory infrastructure has the potential to provide 100 times better performance as compared to Hadoop’s disk-based MapReduce paradigm. The in-memory allows user programs to store data in the cluster’s memory and query it repeatedly. This is done using a Cluster Manager and a Distributed Storage System applications / algorithms (e.g., topic modelling, deep neural network, etc.), as well as massively scalable learning system/platform leveraging both application and infrastructure specific optimizations (exploring data sparsity, parameter server, etc). Spark brings the top-end data analytics, the same performance level and sophistication that you get with these expensive systems, to commodity Hadoop cluster. It runs in the same cluster to let you do more with your data.

If you think about Spark as merely a replacement for Hadoop, you are short-changing yourself. Instead of replacing Hadoop, consider Spark a complementary technology that should be used in partnership with Hadoop. Keep in mind that Spark can run separately from the Hadoop framework, where it can integrate with other storage platforms and cluster managers. It can also run directly on top of Hadoop, where it can easily leverage its storage and cluster manager. Simply put, Spark can run on Hadoop, Mesos, standalone, or in the cloud.

Some insight on Hadoop : https://maliksadiq13.wordpress.com/2013/11/30/big-data-why-telcos-need-a-closer-look-at-hadoop/

Yahoo has two Spark projects in the works, one for personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them. Spark brings the top-end data analytics, the same performance level and sophistication that you get with these expensive systems, to commodity Hadoop cluster. It runs in the same cluster to let you do more with your data

Another early Spark adopter is Conviva, one of the largest streaming video companies on the Internet, with about 4 billion video feeds per month (second only to YouTube). As you can imagine, such an operation requires pretty sophisticated behind-the-scenes technology to ensure a high quality of service. As it turns out, it’s using Spark to help deliver that QoS by avoiding dreaded screen buffering. Conviva uses Spark Streaming to learn network conditions in real time.They feed [this information] directly into the video player, say the Flash player on your laptop, to optimize the speeds. This system has been running in production over six months to manage live video traffic.

Automating machine learning is an area where Apache Spark really shines. Spark is designed for data science and its abstraction makes data science easier. Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. These algorithms are often iterative, and Spark’s ability to cache the dataset in memory greatly speeds up such iterative data processing, making Spark an ideal processing engine for implementing such algorithms.

For example, with Spark you can automatically determine the best way to train your learning algorithm, a technique commonly referred to as hyperparameter tuning. Machine learning is changing not only how we interact with machines, but how we relate to the world around us. During the past decade, machine learning has given us self-driving cars, speech recognition, effective web search and a vastly improved understanding of the human genome.

Expanded data analytics to enable better and faster decisions, is expected to accelerate process/product utilization, consumer/market understanding, and minimize risk in better time. New business requirements and usage models are emerging and driving the need for new big data analysis paradigms. In particular, there is increasing demand from organizations to discover and explore data using advanced analytics algorithms (e.g., large-scale machine learning, graph analysis, statistic modeling) for deep insights. For this and more we have Apache Spark !!

Sadiq Malik ( Telco Strategist )

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s