Title: Entity Resolution with Societal Impacts in Machine Learning
Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in El Salvador? In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as entity resolution (record linkage or de-duplication) to proceed with analysis. Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution, with an application to the El Salvadorian conflict. More specifically, I will discuss unsupervised Bayesian entity resolution models, which are able to identify duplicate records in the data, while quantifying uncertainty. I will highlight the importance of choosing flexible priors and in implementing scalable inference algorithms. I will present preliminary results from the El Salvadorian conflict.
Rebecca C. Steorts received her B.S. in Mathematics in 2005 from Davidson College, her MS in Mathematical Sciences in 2007 from Clemson University, and her PhD in 2012 from the Department of Statistics at the University of Florida under the supervision of Malay Ghosh, where she was a U.S. Census Dissertation Fellow and was a recepient for Honorable Mention (second place) for the 2012 Leonard J. Savage Thesis Award in Applied Methodology. Rebecca was a Visiting Assistant Professor in 2012--2015, where she worked closely with Stephen E. Fienberg.
Rebecca is currently an Assistant Professor in the Department of Statistical Science at Duke University. She is affliated faculty in the Departments of Computer Science and Biostatics and Bioinformatics, the information intiative at Duke (iiD), and the Social Science Research Institute.
Rebecca was named to MIT Technology Review's 35 Innovators Under 35 for 2015 as a humantarian in the field of software. Her work was profiled in the Septmember/October issue of MIT Technology Review and she was recognized with an invited talk at EmTech in November 2015. In addition, Rebecca is a recepient of a NSF CAREER award, a collaborative NSF award, a collaborative grant with the Laboratory of Analaytic (LAS) at NC State University, a Metaknowledge Network Templeton Foundation Grant, the University of Florida (UF) Graduate Alumni Fellowship Award, the U.S. Census Bureau Dissertation Fellowship Award, and the UF Innovation through Institutional Integration Program (I-Cubed) and NSF for development of an introductory Bayesian course for undergraduates. Her research interests are in large scale clustering, record linkage (entity resolution or de-duplication), privacy, network analysis, and machine learning for computational social science applications.