COURSE ID: D-EF36 LANGUAGE:

AN INTRODUCTION TO MACHINE LEARNING USING STATA

Recent years have witnessed an unprecedented increase in the availability of information on social, economic and health-related phenomena. Today researchers, professionals and policy makers have therefore, access to enormous databases (so-called Big Data) containing detailed information on individuals, companies and institutions and use of mobile devices. Machine learning is a relatively new approach to data analytics, which lies at the intersection between statistics, computer science and artificial intelligence. Its primary objective is that of turning information into knowledge and value by “letting the data speak”. In contrast to the more tradition approach of data analysis focusing on prior assumptions relating to data structure and the derivation of analytical solutions, Machine Learning techniques rely instead on a model-free philosophy development of algorithms, computational procedures, and graphical inspection of the data in order to more accurately predict outcomes. Computationally infeasible until very recently, Machine Learning is itself a product of the latest advancements in IT technology, of the computing power and the learning capabilities of today’s computers, of hardware development, and continuous software development.

This intensive introductory course offers therefore an introduction to the standard machine learning algorithms currently applied to social, economic and public health data in order to illustrate (using a series of both official and user written Stata commands), how Machine Learning techniques can be applied to search for patterns in large (often extremely “noisy”) databases, which can subsequently be used to make both decisions and predictions.

In common with TStat’s training philosophy, each individual session is composed of both a theoretical component (in which the techniques and underlying principles behind them are explained), and an extensive applied (hands-on) segment, during which participants have the opportunity to implement the techniques using real data under the watchful eye of the course tutor. Throughout the course, theoretical sessions are reinforced by case study examples, in which the course tutor discusses and highlights potential pitfalls and the advantages of individual techniques. The intuition behind the choice and implementation of a specific technique is of the utmost importance. In this manner, the course leader is able to bridge the “often difficult” gap between abstract theoretical methodologies, and the practical issues one encounters when dealing with real data.

At the end of the course, participants are expected to be able to: i) autonomously implement (with the help of the Stata routine templates developed for the course) the appropriate Machine Learning algorithms, given both the nature of their data and the analysis in hand, and ii) to have mastered the concepts of: factor-importance detection, signal-from-noise extraction, correct model specification and model-free classification, from both a data-mining an causal perspective.

Researchers and professionals working in biostatistics, economics, epidemiology, social and political sciences and public health wishing to implement Machine Learning techniques in Stata.

Participants should be familiar with the statistical software Stata. An introductory knowledge of econometrics and/or statistics is also required.

DAY 1

SESSION I: THE BASICS OF MACHINE LEARNING

Machine Learning: definition, rational, usefulness

Supervised vs. unsupervised learning
Regression vs. classification problems
Inference vs. prediction
Sampling vs. specification error

Coping with the fundamental non-identifiability of E(y|x)

Parametric vs. non-parametric models
The trade-off between prediction accuracy and model interpretability

Goodness-of-fit measures

Measuring the quality of fit: in-sample vs. out-of-sample prediction power
The bias-variance trade-off and the Mean Square Error (MSE) minimization
Training vs. test mean square error
The information criteria approach

Machine Learning and Artificial Intelligence
The Stata/Python integration: an overview

 

SESSION II: RESAMPLING AND VALIDATION METHODS 

Estimating training and test error
Validation

The validation set approach
Training and test mean square error

Cross-Validation

K-fold cross-validation
Leave-one-out cross-validation

Bootstrap

The bootstrap algorithm
Bootstrap vs. cross-validation for validation purposes

 

SESSION III: MODEL SELECTION AND REGULARIZATION

Model selection as a correct specification procedure
The information criteria approach
Subset Selection

Best subset selection
Backward stepwise selection
Forward stepwise Selection

Shrinkage Methods

Lasso and Ridge, and Elastic regression
Adaptive Lasso
Information criteria and cross validation for Lasso

Stata implementation

 

DAY 2

SESSION IV: DISCRIMINANT ANALYSIS AND NEAREST-NEIGHBOR CLASSIFICATION

The classification setting
Bayes optimal classifier and decision boundary
Misclassification error rate
Discriminant analysis

Linear and quadratic discriminant analysis
Naive Bayes classifier

The K-nearest neighbors classifier
Stata implementation

 

SESSION V: NONPARAMETRIC REGRESSION

Beyond parametric models: an overview
Local, semi-global, and global approaches
Local methods

Kernel-based regression
Nearest-neighbor regression

Semi-global methods

Constant step-function
Piecewise polynomials
Spline regression

Global methods

Polynomial and series estimators
Partially linear models
Generalized additive models

Stata implementation

 

DAY 3

SESSION VI:TREE-BASED REGRESSION

Regression and classification trees

Growing a tree via recursive binary splitting
Optimal tree pruning via cross-validation

Tree-based ensemble methods

Bagging, Random Forests, and Boosting

Stata implementation

 

SESSION VII: NEURAL NETWORKS

The neural network model

neurons, hidden layers, and multi-outcomes

Training a neural networks

Back-propagation via gradient descent
Fitting with high dimensional data
Fitting remarks

Cross-validating neural network hyperparameters
Stata implementation

The training course will be held in Berlin from the 15th to the 17th June 2020.

Full-Time Student*: € 1155.00
Academic: € 1605.00
Commercial: € 2110.00

*To be eligible for student prices, participants must provide proof of their full-time student status for the current academic year.
Fees are subject to VAT (applied at the current Italian rate of 22%). Under current EU fiscal regulations, VAT will not however applied to companies, Institutions or Universities providing a valid tax registration number.

Please note that a non-refundable deposit of €100.00 for students and €250.00 for Academic and Commercial participants, is required to secure a place and is payable upon registration. The number of participants is limited to 10. Places will be allocated on a first come, first serve basis.

Course fees cover: teaching materials (handouts, Stata do-files, program templates and datasets to use during the course), a temporary course licence of Stata valid for 30 days from the beginning of the course, light lunch and coffee breaks.

To maximize the usefulness of this course, we strongly recommend that participants bring their own laptops with them, to enable them to actively participate in the empirical sessions.

Individuals interested in attending this course must return their completed registration forms by email (training@tstat.eu) to TStat by the 26th May 2020.


NAME

EMAIL

OBJECT

ADDITIONAL COMMENTS

I authorise the use of my personal data pursuant to Article 13 of L. Decree no 196 / 2003

This intensive introductory course offers therefore an introduction to the standard machine learning algorithms currently applied to social, economic and public health data in order to illustrate (using a series of both official and user written Stata commands), how Machine Learning techniques can be applied to search for patterns in large (often extremely “noisy”) databases, which can subsequently be used to make both decisions and predictions.