Matthew, David and Antoine, my co-authors, are three of the world's very top engineers of data-science solutions, and it's been really exciting to collaborate which people of such deep expertise and talent.
This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.
This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.
So here's the conundrum. Why are there no books which present Spark in this way, recognizing that one of the best reasons to work in Spark is its application to production data science If you scan the bookshelves (or look at tutorials online) all you will find is toy models and a review of the Spark APIs and libs. You will find little or nothing about how Spark fits into the wider architecture, or about how to manage data ETL in a sustainable way.
I think you will find that the practical approach taken by the authors in this book is different. Each chapter takes on a new challenge, and each reads as a voyage of discovery where the outcome was not necessarily known in advance of the exploration. And the value of doing data science properly is set out clearly from the start. This is one of the first books on Spark for grown-ups who want to do real data science that will make an impact on their organisation. I hope you enjoy it.
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at -Spark-Science-Andrew-Morgan-ebook/dp/B01BWNXA82_encoding=UTF8&keywords=mastering%20spark%20for%20data%20science&qid=1490239942&ref_=sr_1_1&sr=8-1.
The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at that level we need to be able to build data science solutions of substance; ones that solve real problems, and which can run reliably enough for people to trust and act upon.
Therefore, whilst this book certainly contains useful code and, in many cases, unique implementations, it further dives deep into the techniques and skills required to truly master data science; some of which are often overlooked or not considered at all. Drawing on many years of commercial experience, the authors have leveraged their extensive knowledge to bring the real, and exciting world of data science to life.
Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.
Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.
Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.
Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.
We presume that the data scientists reading this book are knowledgeable about data science, common machine learning methods, and popular data science tools, and have in the course of their work run proof of concept studies, and built prototypes. We offer a book that introduces advanced techniques and methods for building data science solutions to this audience, showing them how to construct commercial grade data products.
This chapter is an introduction to an approach and ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies. It introduces the environment, and how to configure it appropriately, but also explains some of the nonfunctional considerations relevant to the overall data architecture. While there is little actual data science at this stage, it provides the essential platform to pave the way for success in the rest of the book.
Molly is a native San Franciscan who grew up in Wyoming chasing cows and developing a lifelong love of geology. After returning to California and spending years in Santa Cruz County delivering firewood, cleaning fish, and racing cars, she moved back to the Bay Area and taught Middle School math and science for more than a decade. An early adopter of technology in education, her focus shifted to information literacy, meshing nicely with her love of libraries and data. For fun she digs holes, BBQs, gardens, reads, and plays as much music as possible. Vinyl preferred. 153554b96e