Module | Topics | Demo/Project | ||
1 | Introduction to Big Data and Hadoop | This lession talks about traditional systems, problems associated with traditional large scale systems, what is Hadoop and it’s ecosystem | Introduction to Big Data | |
Big Data Analytics | ||||
What is Big Data | ||||
Four Vs of Big Data | ||||
Challenges of Traditional System | ||||
Distributed Systems | ||||
Introduction to Hadoop | ||||
Components of Hadoop Ecosystem |
Course Plan Detailed
2 | Hadoop Architecture Distributed Storage (HDFS) and YARN | This lesson talks about distributed processing on cluster, HDFS architecture, how to use HDFS, YARN as a resource manager, yarn architecture and how to work with YARN. | What is HDFS and its need | |
Regular File System Vs HDFS | ||||
HDFS Architecture and components | ||||
High Availability Cluster Implementation | ||||
HDFS Component File System and Namespace | ||||
Data Block Split | ||||
Data Replication and Rack Awareness | ||||
HDFS Command Line | Demo | |||
Resource Management: YARN | ||||
Resource Management: YARN Architecture | ||||
Resource Management: Working with YARN | ||||
Walk Through of the Cluster | Demo |
3 | Distribution Processing using MapReduce | This lesson talks about distributed processing framwork, MapReduce and its characteristics and advance MapReduce concepts | Distribution Phases | |
MapReduce Framework | ||||
Word Count Example | Demo | |||
MapReduce Jobs | ||||
Joins in MapReduce |
4 | Data Ingestion and ETL | This lesson talks about Sqoop, basic import and exports in Sqoop, improving sqoop’s performance, limitations of Sqoop and Sqoop2, Apache flume, flume artitecture, flume sources, flume sinks, flume sinks, flume channels, flume configurations, Apache Kafka, its data model, Architecture with Zeppelin integrated. | Apache Sqoop and its Use | |
Import and Export using Sqoop from MySQL to HDFS | Demo | |||
Sqoop Connectors | ||||
Sqoop Demo | ||||
Limitations of Sqoop | ||||
Sqoop 2 | ||||
Apache Flume and its Use | ||||
Flume Model and Scalability | ||||
Flume Architecture | ||||
Configuring Flume Components | ||||
Ingest real time Twitter data | Demo | |||
Apache Kafka | ||||
Aggregating user activity using Kafka | ||||
Kafka Data Model and Partitions | ||||
Kafka Architecture | ||||
API – Producer side and Consumer side | ||||
Setting up Kafka Cluster | Demo | |||
Creating Sample Kafka Data Pipeline Using Producer and Consumer | Demo | |||
Practice Project: Data Ingestion Into Big Data Systems and Etl | Project |
5 | Apache Pig | This lesson talks Apahe Pig, components of Pig, Pig vs SQL and we will learn how to work with Pig | Introduction to Pig | |
Advantage of Pig over MapReduce | ||||
Pig Architecture | ||||
Pig Data Model | ||||
Pig Modes | ||||
Pig Operations and Relations | ||||
Analysing Sales Data using Pig | Demo | |||
Word count problem using Pig | Demo | |||
Practice Project : Airline Data | Project |
6 | Apache Hive | This lesson will introduced to Hive and Impala, why to use Hive and Impala, differences between Hive and Impala, how Hive works and comparison of Hive to traditional databases, and advacned Hive Concepts | Introduction to Hive | |
Hive over MapReduce | ||||
Hive vs Impala | ||||
Hive Architecture | ||||
Hive Metastore | ||||
Hive DDL and DML | Demo | |||
Hive Operations | ||||
Data types and validations | ||||
File format types | ||||
HCatalog | ||||
Data Serialization | ||||
Hive Optimization | ||||
Hive Partitioning | ||||
Hive Bucketing | ||||
Hive Sampling | ||||
CRUD operations in Hive | ||||
Hive Functions | ||||
UDF and UDAF | ||||
Practice Project : Movie Awards Data | Project |
7 | NoSQL database HBase | This lesson gives introction to HBase, HBase artitecture, data storage in HBase, HBase vs RDBMS. | NoSQL Introductions | |
HBase Overview | ||||
HBase Architecture | ||||
Data Model | ||||
Connecting to HBase | ||||
Working with HBase | Demo |
8 | Basics of Functional Programming and Scala | This lession introduces to Scala and Functional Programming | Introduction to Scala | |
Scala Installation | Demo | |||
Functional Programming | ||||
Programming with Scala | ||||
Basic Literals and Arithmetic Operators | ||||
Logical Operators | ||||
Type Inference, Classes, Objects and Functions | ||||
Collections and Types | ||||
Operations on Lists | ||||
Scala REPL | ||||
Practice Project: Companies Data | Project |
9 | Apache Spark | This lesson talks about apache spark, how to use spark shell, RDDs, functional programing in Spark | Introduction to Spark and its History | |
Limitations of MapReduce/Hadoop | ||||
In-memory Processing | ||||
Hadoop Ecosystem vs Spark | ||||
Architecture and Components of Spark | ||||
Spark Cluster in Real World | ||||
Running Scala Programs in Spark Shell | Demo | |||
Setting up Spark IDE | Demo | |||
Spark WebUi | Demo |
10 | Spark RDD | This lesson talks about RDD in detail and all operation associated with it, key value Pair RDD and few more other pair RDD operations. You will learn about RDD lineage, overview on caching, distributed persistence, storage levels of RDD persistence, how to choose the correct RDD persistence storage level and RDD fault tolerance, RDD partitions, how to create partitioning on File based RDD, HDFS and data locality, parallel operations on spark, spark and stages and how to control the level of parallelism | Introduction to Spark RDD | |
Creating Spark RDD | ||||
Pair RDD | ||||
RDD Operations | ||||
Spark Transformations using Scala/Python | Demo | |||
Spark Actions using Scala/Python | ||||
Caching and Persistence | ||||
Storage Levels | ||||
Lineage and DAG | ||||
Debugging in Spark | ||||
Partitioning in Spark | ||||
Scheduling in Spark | ||||
Shuffling in Spark | ||||
Sort and Shuffle | ||||
Aggregating Data with Pair RDD | ||||
Different File Formats | ||||
Real World Application | ||||
Optimizing Spark Jobs | ||||
Practice Project: Bus breakdown and delay | Project
|
11 | Spark SQL and DataFrames | In this lesson you will learn about Spark SQL and SQL Context, creating dataframes, transforming and querying datframes and comraing spark SQL with Impala.Also Spark streaming concepts with winodow and join operations | Spark SQL Introduction | |
Spark SQL Architecture | ||||
Dataframes | Demo | |||
Various data formats | ||||
Dataframe Operations | ||||
UDF and UDAF | ||||
RDD vs Dataframe vs Dataset | ||||
Practice Project- Companies Detail | Project |
12 | Spark MLLib | In this lession you will learn about spark use cases, interactive algorithms in spark, machine learning and k-means algorithm | Introduction to Spark MLLib | |
Modelling Big Data with Spark | ||||
Analytics in Spark | ||||
Machine Learning | ||||
Supervised and Unsupervised Learning | ||||
Linear Regression | Demo | |||
Clustering | Demo | |||
K-Means | ||||
Reinforcement Learning | ||||
Semi-supervised Learning | ||||
MLLib Pipelines | ||||
Practice Project- Spark Mllib – Diamond Pricing | Project |
13 | Spark Streaming | In this lecture you will learn about Spark streaming concepts with winodow and join operations | Streaming Overview | |
Real Time Processing of Big Data | Demo | |||
Data Processing Architecture | ||||
Spark Streaming | ||||
Writing Spark Streaming Application | Demo | |||
Introduction to Dstreams | ||||
Transformations on Dstreams | ||||
Design Patterns | ||||
Windowing | Demo | |||
Join Operations | ||||
Processing Twitter dataset | Demo | |||
Structured Spark Streaming | ||||
Output Sinks | ||||
Structured Streaming APIs | ||||
Streaming Pipelines | Demo | |||
Practice Project – Spark Steaming | Project |
14 | Spark GraphX | In this lecture you will learn abount graph processing and analysis | Spark GraphX | |
Introduction to graphs | ||||
Graph Operators | ||||
Join Operators | ||||
Graph Parallel System | ||||
Algorithms in Spark | ||||
Pregel API | ||||
Graphx Vertex Predicate | Demo | |||
Demo Page Rank Algorithm | Demo | |||
Practice Project – Flight times | Project |
15 | Course End Project | Car Insurance Analysis | Project | |
Transactional Data Analysis |