[PAST EVENT] Data Enrichment for Data Science

February 18, 2019
8am - 9am
Location
McGlothlin-Street Hall, Room 020
251 Jamestown Rd
Williamsburg, VA 23185Map this location

Speaker: Fatemeh Nargesian, University of Toronto


Title: Data Enrichment for Data Science


Abstract:

Preparing data for advanced analytics is prohibitively time-consuming and

computationally expensive. In this talk, I will discuss my research on the

challenges of data preparation for data science. In particular, I will

talk about data discovery problem. In data science, it is increasingly the

case that the main challenge is not in integrating known data, rather it

is in discovering the right data to solve a given data science problem. I

discuss two paradigms of data discovery. In the first paradigm, the query

is a dataset and the data scientist is interested in interactively finding

datasets that can be integrated (e.g unioned) with the query. I will

introduce a probabilistic framework for searching for top-k unionable

tables and aligning them with a query table and discuss the need for

distribution-aware techniques for data discovery. In the second paradigm,

search does not start with a query, instead, it is data-driven. I will

talk about data lake organization problem where the goal is to find a

directory structure -- data lake organization -- that allows a user to

most efficiently navigate data lakes. I will present a probabilistic

navigation model of how users interact with a directory structure and

introduce a scalable local search algorithm for optimizing data lake

organizations.


Bio:

Fatemeh Nargesian is a PhD candidate in the Data Curation Group of the

Department of Computer Science at University of Toronto. Her primary

research interests are in the data management challenges of end-to-end

data science. A paper she co-authored on data discovery was accorded the

Best Demonstration Award at VLDB 2017. While at University of Toronto,

Fatemeh was a joint Research intern at IBM Research-NY. Prior to

University of Toronto, she worked on clinical data management at the

Clinical Informatics Research Group at McGill University, and received

M.Sc. degrees in Computer Science at University of Ottawa and Artificial

Intelligence at Sharif University of Technology.

Contact

Pieter Peers