Learning on messy tabular data
(July 3, 11:00 AM - 3:00 PM)
Many if not most data science projects are run on tabular data: data from one or multiple tables with columns of diverse nature. Tabular data comes with its own challenges: many entries are of discrete nature (categories or entities), entries may be missing, the data may need to be enriched by joining multiple tables. Additional data-integration challenges arise when the tables are assembled across different sources and come with different conventions. In this lecture I will present various machine-learning methods dedicated to such data. I will illustrate these methods with example using the dirty-cat and scikit-learn Python packages.