BIG DATA ANALYTICS

BIG DATA ANALYTICS

COURSE DESCRIPTION

This course is an introduction to large-scale data analytics. Big Data analytics is the study of how to extract actionable, non-trivial knowledge from a massive number of data sets. This class will focus both on the cluster computing software tools and programming techniques used by data scientists and the important mathematical and statistical models used in learning from large-scale data processing. On the tool's side, we will cover the basic systems and techniques to store large volumes of data and modern systems for cluster computing based on MapReduce patterns such as Hadoop MapReduce, Apache Spark, and Flink. Students will implement data mining algorithms and execute them on real cloud systems like Amazon AWS, Google Cloud, or Microsoft Azure by using educational accounts. On the data mining models side, this course will cover the main standard supervised and unsupervised models and will introduce improvement techniques on the model side.

MODULE 1: INTRODUCTION TO BIG DATA PROCESSING

· Introduction to Big Data Analytics. What is Big Data? What are the challenges?

· Introduction to Apache Hadoop and MapReduce. Apache Spark.

· Spark programming. (Python and PySpark)

· Spark - Resilient Distributed Dataset (RDDs).

MODULE 2: LARGE-SCALE DATA PROCESSING WITH PYSPARK

· Spark - RDDs, DataFrames, Spark SQL

· PySpark + NumPy + SciPy, Code Optimization, Cluster Configurations

· Linear Algebra Computation in Large Scale.

· Distributed File Storage Systems

MODULE 3: DATA MODELING AND OPTIMIZATION PROBLEMS

· Introduction to modeling: numerical vs. probabilistic vs. Bayesian

· Introduction to Optimization Problems

· Batch and stochastic Gradient Descent

· Newton’s Method

· Expectation-Maximization,

· Markov Chain Monte Carlo (MCMC)

MODULE 4: LARGE-SCALE SUPERVISED LEARNING

· Introduction to Supervised learning

· Generalized Linear Models and Logistic Regression

· Regularization

· Support Vector Machine (SVM) and the kernel trick

· Outlier Detection

· Spark ML library

MODULE 5: LARGE-SCALE UNSUPERVISED LEARNING

· Introduction to Unsupervised learning

· K-means / K-medoids

· Gaussian Mixture Models

· Dimensionality Reduction

· Spark MLlib for Unsupervised Learning

MODULE 6: LARGE SCALE TEXT MINING

· Latent Semantic Indexing

· Topic models

· Latent Dirichlet Allocation

· Spark ML library for NLP

BIG DATA ANALYTICS

LEARN MORE

GET MORE KNOWLEDGE

Technology

Get In Touch

Quick Links