Highlights
- Understand the need for Spark in data processing
- Understand the Spark architecture and how it distributes computations to cluster nodes
- Become familiar with basic installation/setup/layout of Spark
- Use Spark for interactive and ad-hoc operations
- Use DataSet/DataFrame/Spark SQL to efficiently process structured data
- Understand the basics of RDDs (Resilient Distributed Datasets), data partitioning, pipelining and computations
- Understand performance implications and optimisations when using Spark
- Understand Spark's data caching and usage
- Become familiar with Spark Graph Processing and SparkML machine learning
Course Details
Module 1 - Introduction to Spark - Getting started
- What is Spark and what is its purpose?
- Overview, Motivations, Spark Systems
- Spark Ecosystem
- Spark vs. Hadoop
- Typical Spark Deployment and Usage Environments
- Components of the Spark unified stack
- Resilient Distributed Dataset (RDD)
- Downloading and installing Spark standalone
- Python overview
- Launching and using the Python shell
Module 2 - Resilient Distributed Dataset and DataFrames
- Understand how to create parallelized collections and external datasets
- Work with Resilient Distributed Dataset (RDD) operations
- Utilize shared variables and key-value pairs
- RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
- Working with RDDs - Creating and Transforming (map, filter, etc.)
- Caching - Concepts, Storage Type, Guidelines
- Introduction and Usage
- Creating and Using a DataSet
- Working with JSON
- Using the DataSet DSL
- Using SQL with Spark
- Data Formats
- Optimizations: Catalyst and Tungsten
- DataSets vs. DataFrames vs. RDDs
Module 3 - Spark application programming
- Understand the purpose and usage of the SparkContext
- Initialize Spark with the Python programming language
- Describe and run some Spark examples
- Pass functions to Spark
- Create and run a Spark standalone application
- Submit applications to the cluster
- Overview, Basic Driver Code, SparkConf
- Creating and Using a SparkContext/SparkSession
- Building and Running Applications
- Application Lifecycle
- Cluster Managers
- Logging and Debugging
Module 4 - Introduction to Spark libraries
- Understand and use the various Spark libraries
Module 5 - Spark configuration, monitoring and tuning
- Understand components of the Spark cluster
- Configure Spark to modify the Spark properties, environmental variables, or logging properties
- Monitor Spark using the web UIs, metrics, and external instrumentation
- Understand performance tuning considerations
- The Spark UI
- Narrow vs. Wide Dependencies
- Minimizing Data Processing and Shuffling
- Caching - Concepts, Storage Type, Guidelines
- Using Caching
- Using Broadcast Variables and Accumulators
Module 6 - Spark STREAMING (optional)
- Overview and Streaming Basics
- Structured Streaming
- DStreams (Discretized Steams),
- Architecture, Stateless, Stateful, and Windowed Transformations
- Spark Streaming API
- Programming and Transformations
Who should attend
Python or Java/Scala developers who need to learn about how to develop Big Data and ML solutions with Apache Spark
Feedback
4.8 out of 5 average
"Good introduction to Apache Spark. The trainer was great at talking us through the information, specifically optimisation methods." RL, Financial Crime Technologist, Apache Spark, April 2021
“JBI did a great job of customizing their syllabus to suit our business needs and also bringing our team up to speed on the current best practices. Our teams varied widely in terms of experience and the Instructor handled this particularly well - very impressive”
Brian F, Team Lead, RBS, Data Analysis Course, 20 April 2022