A&S Graduate Studies
[PAST EVENT] Jianing Zhao, Computer Science - Final defense
Abstract:
In recent years, machine learning methods enable us to achieve prediction with quite good precision with large training data such as deep learning. However, for many problems, we care more about causality than prediction. For example, instead of knowing smoking is statistically associated with lung cancer, we are more interested in knowing if smoking is the cause for lung cancer. With causality, we can understand how the world progresses and make impact to the outcome by influencing the cause. This thesis explores how to quantify the causal effects of a treatment on an observable outcome in the presence of heterogeneity. We focus on investigating causal impacts of World Bank projects on the change of environment. This high-dimensional World Bank data set includes covariates from various sources and of different types. The data set includes variables for time series data such as (Normalized Difference Veg- etation Index) NDVI values, temperature and precipitation, spatial data such as longitude and latitude, and many other features such as distance to roads, distance to rivers.
We estimate the heterogeneous causal effect of World Bank projects on the change of NDVI values. Based on causal tree and causal forest proposed by Athey, we described the challenges we met and lessons we learned when applying these two methods to a real World Bank data set. We show our observations of the heterogeneous causal effect of the World Bank projects on the change of environment. As we do not have ground truth for the World Bank data set, we validate the results with synthetic data by simulation studies.
The synthetic data is sampled from distributions fitted with the World Bank data set. We compared the results among various causal inference methods. We also observed that feature scaling is very important to generate meaningful data and results. In addition, we investigate the performance of causal forest with various parameters such as leaf size, number of confounders and data size.
Causal forest is a black-box model, and the results from it cannot be easily interpreted and are hard for a human to understand. By taking advantage of the tree structure, the neighbors fo the project to be explained are selected. The weights are assigned to the neighbors according to dynamic distance metrics. We can learn a linear regression model with the neighbors and interpret the result with the help of the learned linear regression model.
In summary, World Bank projects have small impacts on the change of environment and the result of an individual project can be interpreted using a linear regression model learned from project closed to the projects.
Bio:
Jianing Zhao is a Ph.D. candidate at William & Mary, working with Dr. Peter Kemper. His research interests are in causal inference, data mining and machine learning. He received his B.S. degree in EE and master's degree in Computer Engineering from China Agricultural University.