Télécom ParisTech

Data Science (SD)

The Data Science track covers all fields related to the exploitation, management, and analysis of large datasets, both structured and unstructured. Examples of jobs that this track naturally leads to are those of data scientists, engineering statisticians, database administrators, or research and R&D careers in machine learning, data management, data extraction, data mining and statistics.

 From the second semester of the M1 year the track is divided into two branches:

  • Machine Learning (at the intersection between computer science and mathematics)
  • Data Management (computer science)

In the second year students will follow 6 common course units and 2 elective units in Period 3 and Period 4 (semester two).

During the third year they can choose between several programs in the two branches at the Paris Saclay University campus or at Telecom ParisTech.

2nd year courses

SD 2nd year program
(192 hours) 
Fall semester

Spring semester

Period 1 Period 2 Period 3 Period 4
Time slot C1 SD201 Mining of Large Datasets SD203 Web Development SD210 Basics of Statistical Machine Learning SD211 Optimization for Machine Learning
Time slot C2 SD204 Statistics: linear models SD202 Databases SD205 (Option Machine Learning)
Advanced Statistics
SD206 (Option: Data Management)
 Logic and Knowledge Representation
SD207 (Option Machine Learning)
Statistical Machine Learning in practice
SD208 (Option Data Management)
Advanced Databases

Details :

Fall semester, period 1

  • SD 201 Mining of Large Datasets (24 hrs)
    The course will present algorithms for data analysis and mining while focusing on mining massive datasets such as large-scale network data. It will focus on both practical and theoretical aspects of data mining. During the course, the students will become familiar with the most successful algorithms for clustering, ranking, mining frequent itemsets, recommender systems, as well as community and event detection.  Students will work on a small project where they will implement some of the algorithms above in Hadoop (one of the most successful systems to process massive amounts of data) and analyze real-world data.
  • SD 204 Statistics: linear models (24 hrs)
    In this course, we will first introduce the linear model for least square regression as well as present a more general framework that also includes logistic regression. We will consider the tasks of estimation and tests in these models. The last part of the course will be devoted to model selection, based on the norm L1 regularization and on greedy selection approaches.

Fall semester, period 2

  • SD 202 Databases (24 hrs)
    Nowadays, databases are the core of any information system. Appearing in the 1980s, relational database systems have evolved continuously and integrated more than 30 years of research and technology. The objective of this course is to understand the foundations of databases, their design and their management. It focuses on relational systems which represent the most sophisticated and successful technology in this field. The techniques that are presented highlight important concepts, such as :The relational model - Relational algebra and SQL - Database design - Consistency and integrity, etc.
  • SD 203 Web Development (24 hrs)
    The objective of this module is to be able to develop dynamic, modern, robust, and secure websites. Topics addressed are the Internet and the Web, basic Web languages (HTML, CSS, JavaScript), rich dynamic content, server-side programming and frameworks, client-side frameworks and AJAX, connecting to a database management system (MySQL), Web security, and Web ergonomics. The module will be evaluated on practical assignments done during lab sessions.

Spring semester, period 1

  • SD 210 Basics of statistical inference (24 hrs)
    Statistical learning is concerned with the inference of models for pattern recognition, the prediction and diagnostic in a probabilistic and statistical setting. The course first presents how to set a supervised learning problem both in classification and regression, how to solve the underlying optimization task and how to evaluate the resulting estimates. In supervised classification, both discriminant and generative approaches will be presented. A short introduction to unsupervised learning will conclude the course.

+ to choose from

  • SD 205 Advanced Statistics (24 hrs)
    In many situations, data available for statistical tasks (estimation, prediction) are so complex that parametric modeling is completely in vain, at least upon initial analysis. The purpose of this course is to present less rigid statistical techniques as well as the theoretical issues related to their practical implementation - the price to be paid for an increased flexibility being the possibility for the model selected to overfit the data.  Through examples, the “minimax” viewpoint for nonparametric estimation will be presented as well as the trade-off “bias/variance” depending on the tuning of the model complexity and the notion of “Empirical Risk Minimization”, the main paradigm of Statistical Learning theory will be introduced.
  • SD 206 Logic and Knowledge Representation (24 hrs)
    This module presents fundamental concepts and techniques which underlie the development of intelligent systems and knowledge representation: the Prolog programming language, formal logic, complexity, symbolic machine learning, natural language processing, and knowledge representation formalisms.

Spring semester, period 4

  • SD 211 Optimization for Machine Learning (24 hrs)
    A broad class of machine learning problems boils down to the minimization of a functional (typically an empirical risk). Optimization methods are therefore at the heart of practical machine learning. In this module, the students will learn a collection of theoretical foundations in the continuity of the previous optimization course and will discover several techniques allowing to specifically address the case of massive data sets.

+ to choose from

  • SD 207 Statistical Machine Learning in practice (24 hrs)
    In this course, different advanced machine learning tasks will be introduced and addressed through a practical angle. Targeted problems such as multiclass classification, imbalanced classes, anomaly detection, sequence modeling and source separation will be considered and appropriate tools will be developed to solve them. The student will learn to address a realistic/advanced learning task, starting from the data processing to the evaluation of the results.
  • SD 208 Advanced Databases (24 hrs)
    This course provides an in-depth teaching of Database Management Systems, their architecture, and their evolution. It presents the essential components of such systems (storage, indexing, transactions, query evaluation, optimization, distribution, etc.), focusing particularly on the relational database technology. It also presents some post-relational database systems designed for the management of heterogeneous, complex, or semi-structured data (XML DBMS, XPath, XQuery).



3rd year options

The SD education track includes a large number of programs in the third year that can be chosen by the students.

  • on the Télécom ParisTech campus
    For all students: specializations at Telecom ParisTech
    - courses from the M2 curriculum (120 hours)  
    - and a Research and Innovation  Project  (120 hours)  
  • M2  on the Paris-Saclay campus 
    Students have to apply at Paris-Saclay University for a double degree.

Machine Learning
Paris-Saclay University: 
- M2 AIC: Machine Learning, Information and Content  (Computer Science school)
- M2 Data Sciences  (Mathematics  and Applications school)

Data Management
Paris-Saclay University: 
- M2  DataScale: Data Management in a Digital World (Computer Science school)
- M2  D&K: Data & Knowledge (Computer Science school)