Meet the Team: CSDA data scientist Diana Benavides Prado

May 18, 2017

“The opportunity to apply my computer science background and my knowledge of machine learning methods to contribute to solving social challenges is really exciting.”

Data scientist Diana Benavides Prado joined CSDA last year and her current focus is on improving the accuracy of the Allegheny Family Screening Tool which has been implemented by Allegheny County’s Department of Human Services in the US.

Benavides Prado holds a Master of Computer Engineering and Computer Science and is completing her PhD in Computer Science at the University of Auckland.

She says the existing tool is already helping Allegheny’s child protection staff with screening and referral decisions by analysing risk factors and creating a ‘screening score’ which indicates the likelihood of maltreatment of the child in the future.

Staff use this information, in combination with their own experience and understanding of the context, to help determine whether to refer the case for further investigation and monitoring.

“What we are doing with predictive risk modelling in child welfare isn’t designed to replace human decisions, instead our model supports people to make better decisions by helping to analyse all the data that is available.”

The screening tool, developed by CSDA, is currently powered by a logistic regression model which applies a series of binary queries - such as ‘male or female’, ‘under 5 years or 5 years +’, ‘household of less than 6 people or 6 people +’ - to calculate the risk score.

Benavides Prado is currently investigating a more complex rule-based model - the random forest model - which can analyse and cross-reference numerous sets of data or ‘decision trees’ and which could arrive at more accurate predictions about the likelihood of a child being at risk of abuse. Other rule-based models like XGBoost and statistical learning methods like SVM are also being explored.

“This sort of model can identify more complex patterns, more complex interactions in the data.”

Work to date has confirmed the accuracy of the more complex modelling tool using historic Allegheny County data but the CDSA team wants to answer several questions before it can be implemented (alongside the earlier model).

“We have to identify where the models are in agreement and where they are in disagreement, and we have to answer why. It may be that one model is more accurate in most contexts, but in a few specific contexts the other model will give a more accurate prediction of the outcome.”

The more complex data model also means it is harder to see - and harder to explain - how the conclusion was arrived at.

“People using these models want to understand how the model has reached its answer and they are right to ask. In a situation where you are helping to determine decisions about a child’s welfare you want to be as clear as possible about your reasoning.”

Benavides Prado says in the field of computer science a lot of time has been spent on making these advanced models very accurate but much less time has been spent on explaining how they work.

She hopes to develop the model so it delivers a small summary explanation alongside its conclusion.

“Even with the challenges it is important that we work towards the most accurate models because, ultimately, improving the accuracy will mean better decisions are made about at-risk children.”