Rada
Menczel

The Secret Sauce of Data Preprocessing in Machine Learning

Il Makiage

Rada
Menczel

The Secret Sauce of Data Preprocessing in Machine Learning

Il Makiage

Bio

Rada Menczel is Data Science Director at Il Makiage. She has vast experience in managing data scientists, leading DS projects and researching algorithms and models in many domains, especially in cyber and fintech. Rada holds an MSc in Information Systems Engineering from Ben-Gurion University, specializing in machine learning and recommender systems. She is enthusiastic about data science, machine learning, deep learning, and anything at all related to learning.

Bio

Rada Menczel is Data Science Director at Il Makiage. She has vast experience in managing data scientists, leading DS projects and researching algorithms and models in many domains, especially in cyber and fintech. Rada holds an MSc in Information Systems Engineering from Ben-Gurion University, specializing in machine learning and recommender systems. She is enthusiastic about data science, machine learning, deep learning, and anything at all related to learning.

Abstract

When data scientists wish to train new models, they have a general idea of what their flow will look like. Assuming that the problem they need to solve is well defined, they need to explore the data, define labels, visualize, train, evaluate, tune and test. The most time consuming and often tedious part is data preprocessing and preparation. Should you not fully invest in this stage, you may still get a decent model – but is that good enough? What if I told you that by adding a small step, you can improve your model results and achieve greatness?

In this talk, I will present a problem that is often being ignored – identical feature vectors with different labels. We will discuss why this happens and how you can solve it in different ways in all possible domains. By the end of this discussion, you will wonder how you ever preprocessed without this phase.

Abstract

When data scientists wish to train new models, they have a general idea of what their flow will look like. Assuming that the problem they need to solve is well defined, they need to explore the data, define labels, visualize, train, evaluate, tune and test. The most time consuming and often tedious part is data preprocessing and preparation. Should you not fully invest in this stage, you may still get a decent model – but is that good enough? What if I told you that by adding a small step, you can improve your model results and achieve greatness?

In this talk, I will present a problem that is often being ignored – identical feature vectors with different labels. We will discuss why this happens and how you can solve it in different ways in all possible domains. By the end of this discussion, you will wonder how you ever preprocessed without this phase.