BIG
DATA ANALYTICS
COURSE DESCRIPTION
This
course is an introduction to large-scale data analytics. Big Data analytics is
the study of how to extract actionable, non-trivial knowledge from a massive
number of data sets. This class will focus both on the cluster computing
software tools and programming techniques used by data scientists and the
important mathematical and statistical models used in learning from large-scale
data processing. On the tool's side, we will cover the basic systems and
techniques to store large volumes of data and modern systems for cluster
computing based on MapReduce patterns such as Hadoop MapReduce, Apache Spark,
and Flink. Students will implement data mining algorithms and execute them on
real cloud systems like Amazon AWS, Google Cloud, or Microsoft Azure by using educational
accounts. On the data mining models side, this course will cover the main
standard supervised and unsupervised models and will introduce improvement
techniques on the model side.
MODULE 1: INTRODUCTION TO BIG DATA
PROCESSING
· Introduction to Big Data Analytics. What
is Big Data? What are the challenges?
· Introduction to Apache Hadoop and
MapReduce. Apache Spark.
· Spark programming. (Python and PySpark)
· Spark - Resilient Distributed Dataset
(RDDs).
MODULE 2: LARGE-SCALE DATA
PROCESSING WITH PYSPARK
· Spark - RDDs, DataFrames, Spark SQL
· PySpark + NumPy + SciPy, Code
Optimization, Cluster Configurations
· Linear Algebra Computation in Large
Scale.
· Distributed File Storage Systems
MODULE 3: DATA MODELING AND
OPTIMIZATION PROBLEMS
· Introduction to modeling: numerical vs.
probabilistic vs. Bayesian
· Introduction to Optimization Problems
· Batch and stochastic Gradient Descent
· Newton’s Method
· Expectation-Maximization,
· Markov Chain Monte Carlo (MCMC)
MODULE 4: LARGE-SCALE SUPERVISED
LEARNING
· Introduction to Supervised learning
· Generalized Linear Models and Logistic
Regression
· Regularization
· Support Vector Machine (SVM) and the
kernel trick
· Outlier Detection
· Spark ML library
MODULE 5: LARGE-SCALE UNSUPERVISED
LEARNING
· Introduction to Unsupervised learning
· K-means / K-medoids
· Gaussian Mixture Models
· Dimensionality Reduction
· Spark MLlib for Unsupervised Learning
MODULE 6: LARGE SCALE TEXT MINING
· Latent Semantic Indexing
· Topic models
· Latent Dirichlet Allocation
· Spark ML library for NLP
1st Floor, Phase 3 St, Golden nagar, Vaibhav Nagar, Katpadi, Vellore, Tamil Nadu 632014
© PEMCHIP. All Rights Reserved. Designed by HTML Codex