### Data science with R programming

1.History and Overview of R

• What is R?
• What is S?
• The S Philosophy
• Back to R
• Basic Features of R
• Free Software
• Design of the R System
• Limitations of R
• R Resources

2.Getting Started with R

• Installation
• Getting started with the R interface

3.R Nuts and Bolts

• Entering Input
• Evaluation
• R Objects
• Numbers
• Attributes
• Creating Vectors
• Mixing Objects
• Explicit Coercion
• Matrices
• Lists
• Factors
• Missing Values
• Data Frames
• Names
• Summary

4.CONTENTS

• Getting Data In and Out of R
• Calculating Memory Requirements for R Objects
• Using Textual and Binary Formats for Storing Data
• Using dput() and dump()
• Binary Formats
• Interfaces to the Outside World
• File Connections
• Reading Lines of a Text File
• Reading From a URL Connection
• Subsetting R Objects
• Subsetting a Vector
• Subsetting a Matrix
• Subsetting Lists
• Subsetting Nested Elements of a List
• Extracting Multiple Elements of a List
• Partial Matching
• Removing NA Values
• Vectorized Operations
• Vectorized Matrix Operations
• Dates and Times
• Dates in R
• Times in R
• Operations on Dates and Times
• Summary

5.Managing Data Frames with the dplyr package

• Data Frames
• The dplyr Package
• Dplyr Grammar
• Installing the
• Dplyr  package
• select()
• filter()
• arrange()
• rename()
• mutate()
• CONTENTS
• group_by()
• %>%

1.Probability and Statistical Methods:

Introduction to random variables, probability theory, conditional probability, Bayes Theorem.

• Central tendencies (Mean, Median, Mode); Measures of spread (Range, Variance, Standard Deviation); Basics of Probability Distributions; Expectation and Variance of a variable.
• Discrete probability distributions: Geometric, Poisson.
• Continuous probability distributions: Exponential, Normal distribution; t-distribution
• Central Limit Theorem; Sampling distributions; Confidence Intervals, Hypothesis Testing.
• statistical hypothesis testing and will be introduced to various methods such as chi-square test, t-test, z-test, F-test and ANOVA
• Covariance and Correlation.
• Hands-on implementation of each of these methods will be conducted in R.

3. Statistical and Probability in Decision Modelling:

• Two very powerful techniques, viz., Linear Regression and Logistic Regression, which are used to solve problems in Prediction and Classification.
• A very brief math refresher on calculus and gradient descents and arriving at suboptimal or optimal solution.
• Relationship between multiple variables: Regression (Linear, Multivariate Linear Regression) in prediction.
• Least squares method.
• Identifying significant features, feature reduction using AIC, multi-collinearity check, observing influential points, etc.
• Checking and validating linear fit, model assumptions and taking actions.
• Hands on R-Session of Logistic and linear regression.

3.Algorithms in Machine learning:

Unsupervised:

• Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behaviour. (K-Means)

Supervised learning:

• Decision trees.
• Support vector machines
• Random Forest
• Ensemble modelling
• Bagging & boosting and its impact on bias and variance
• XGboost

4.Text mining, Natural language processing:

Introduction to the Fundamentals of information retrieval; Language modeling

• n-gram models of language
• Smoothing
• Probabilistic language models

Feature engineering:

• TF and IDF
• Bow technique, word2vec.
• Thinking about the math behind text; Properties of words; Vector Space Model
• Evaluation Metrics for Ranking

Natural Language Processing

• Stemming, Phrase identification, word sense disambiguation
• POS tagging
• Parsing and semantic structures
• Coreference resolution

Topic Modelling using LDA

• Course duration:  90 min/day
• No. of Sessions: 45
• Weekend Batch Starting August 1st Week
• Course Fee: Rs 10000/-