Data Science Ph.D. Day 2021


  • Workshop


  • 2020/2021


Monday, June 28, 2021 - 09:30 to 13:30



09:30 Bellomo Lorenzo
Title: Knowledge Extraction and Modelling via Knowledge Graphs with Applications to Biology and News Feed
Information often lies hidden in text. Thus, knowledge extraction techniques have been developed for many years now to model the content of textual pieces. Initially, such technologies deployed only frequency-based and syntactic information about individual words, eventually advancing to orchestrate AI-techniques able to derive more sophisticated relations among words and sentences. Unfortunately, these techniques fail yet to capture some key information which are usually domain specific or go beyond the purely syntactic sphere, and are mandatory to make modern AI systems more usable and effective.
For example, newspapers of opposing political leanings will document the same event in completely different lights. This means that topics alone are not enough to properly model information in news, and "bias" should also be taken into account to properly offer a pluralistic view in news access. Another example is the one emerging from the BioMed domain. Here, it is easy for machines to understand the bio-entities dealt with in a paper (i.e. genes, proteins, and drugs), but it is still hard for them to infer their biological relations and thus build networks that would make those tools effective for supporting scientists in drug discovery and analysis.
The goal of my PhD will be to focus on the design of novel algorithms and AI techniques that are able to extract, model and exploit the additional information present in unstructured texts (mainly from the BioMedical and News domain) in order to build rich information networks (a.k.a. knowledge graphs) that may be used to empower clustering, recommendation and inference steps on which several modern AI applications hinge upon.
In this first year of my PhD, I have achieved some preliminary results (appeared in the Proc. of COMPLEX NETWORKS 2020) showing that it is possible to automatically infer biological networks from PubMed publications by exploiting novel domain specific bio-entity annotators, NLP tools and word-entity embeddings.

09:50 Pugnana Andrea
Title: Causality and fairness in AI-driven dynamical systems
The usage of Artificial Intelligence (AI) to support decision making has become pervasive in many contexts. While the benefits of AI models cannot be neglected, they may embed forms of bias that could harm individuals or social groups, especially in socially sensitive decision making. Fairness-aware AI approaches have been proposed in the last decade tackling the problem from technical and multi-disciplinary perspectives. An open problem in this strand of research is how to tackle fairness in dynamical systems, particularly the study of long-term impacts of the AI models. The aim of this project is to address such an issue from a causal reasoning perspective that would allow to simulate interventions and evaluate short and long-term outcomes of (un)fair AI models.

10:10 Galli Filippo
Title: Justifications and Open Problems in Federated Learning and Differential Privacy
Modern data science and machine learning are data-hungry and pervasive technologies. Harvested information in the form of raw data or trained mod- els can be problematic when it discloses more information than expected. In this context we are interested in de-coupling two different kinds of informa- tion that are connected to the user’s data and the probabilistic models built from them: distributional and private information. Two lines of research adopting this principle are federated learning and differential privacy, which aim to provide privacy guarantees (i.e. filtering out private information) while maintaining utility (i.e. releasing distributional information).
Federated Learning is a paradigm for the distributed training of a global model with data localized at the edge, with a central node orchestrating the optimization and never seeing the actual data. The data holder contributes by providing model updates directly, instead of giving their data away, thus providing some form of privacy guarantee. Training under this setting is hindered by failing of the underlying assumption of i.i.d data distribution, so that convergence is often problematic with Stochastic Gradient Descent.
Differential privacy is a mathematical framework for measuring, analysing and providing privacy guarantees to a user whose data is part of a statistical database. It provides plausible deniability of the values in the database by injecting carefully tuned noise in the query function. While there exist nu- merous instances of successful applications of differential privacy on machine learning tasks with non-relational data (such as images, videos, natural lan- guage), extending results to tasks on relational data, such as inference on graphs, is still an open problem.
10:30 PAUSA

10:50 Spinnato Francesco
Title: Explanation Methods for Sequential Data Models
In recent years, there has been an ever growing interest in defining eXplainable AI (XAI) methods to describe the behavior of black-box models trained on sequential data. In fact, sequential data is used in numerous fields and it is often relied on in crucial in tasks and in high-stakes decision making. The main goal of my research is to study the interpretability problem for this kind of data, for the tasks of classification, regression and forecasting. Defining XAI solutions requires to take into account the various dimensions of the problem such as the kind of sequential data, the type of prediction task and the explanation required, in order to build a trustworthy interaction between the human expert and the AI system.

11:10 Alvarez Jose Manuel
Title: Counterfactual reasoning for fair ranking algorithms in screening processes
Today algorithms are increasingly being used for automated decision making (ADM) processes. Along with the gains from efficient, scalable, and consistent decision making, however, comes the risk of bias perpetuation as these algorithms have repeatedly been shown not to be immune to the prejudices of the societies that train them. My PhD focuses on developing techniques for detecting bias in the data used for training ADMs as well as the data generated by ADMs using causal discovery, inference, and reasoning. I am interested in screening processes where an algorithm helps a human decision maker (or augmented ADM) by providing a final ranking of candidates for her to choose from. Here, the human is faced with a problem of delegation: to what extent does she trust the algorithm’s ranking? Under this context, I wish to explore this screening process and its components from the perspective of counterfactual fairness. Counterfactual reasoning comes down to answering the question what if for a given candidate, which can be seen in practice as a matching problem between similar candidates within the same list of applicants or between similar hypothetical versions of the same candidate. And thus, counterfactual fairness centers around checking whether similar candidates are treated the same by the algorithm. Here, however, the meaning of similarity is contentious. On one hand, counterfactual reasoning has become a useful tool for examining fairness through its convenient ways of exploiting similarities across candidates, while, on the other one, it has been heavily criticized for reducing complex questions such as gender or racial discrimination to mere “what if” statements that ignore the social meanings behind these labels. The goal of my PhD is to try to bridge these two camps by finding new methods for reweighting the value of the candidates’ signals, which are used for determining the ranking, based on some preconceived notion of intergenerational justice. For this, I plan to explore notions of cumulative (un)fairness and welfare dynamics, signaling games from information economics, and EU discrimination law for algorithmic decision making. The end goal is to develop causal methods for detecting bias in data for and from ADMs that are both useful (as in deployable) and robust to dealing with the non-trivialities that come with working with protected attributes such as gender or race.

11:30 Andreani Mila
Title: Forecasting dynamic quantiles via Random Forests and mixed-frequency data
Quantile regression is a powerful technique that allows to model the conditional distribution of the response variable instead of only the expected value, as in the standard linear regression. In the machine learning field, Random Forests are one of the most used algorithms to perform quantile regression. However, when data are related to dynamic observations, such as time series, this model does not allow to compute quantiles in a dynamic framework. Moreover, in many time series datasets, such as in the economics and finance fields, variables are collected at different frequencies, and cannot be directly handled by the algorithm. Therefore, the aim of the proposed methodology is twofold: firstly, the issue of temporal dynamics in time series is addressed by developing the new model Dynamic Quantile Regression Forest. This model estimates the conditional quantile by considering the evolution of the quantile over time among other covariates. Secondly, an innovative approach based on the Mixed Data Sampling model is proposed to include mixed-frequency data in the Random Forests algorithm. The aim is to forecast conditional quantiles by exploiting the additional information coming from low-frequency variables.


12:10 Nardi Mirko
Title: Decentralized and resource efficient machine learning methods at Edge of the Internet
Modern AI has become a breakthrough in a wider variety of fields than ever before, and machine learning comprises a huge part of this revolution. This is mainly due to the spread of connected devices and the resulting availability of huge amount of data, which make up the fuel of these modern techniques. At the same time, the data collection process on which centralized ML solutions are based is getting more and more problematic. On the one hand there are privacy issues on data generated at the edge of the Internet, i.e. users and Industry 4.0 factories might not be willing to share their data with third parties. On the other hand, there are technological challenges triggered by the exponential growth of the number of devices at the edge that will further accelerate the rate at which data is being generated. Therefore, it is of paramount importance to design new solutions for extracting knowledge from data within such complex and intrinsically distributed scenarios. Distributed/decentralized ML executed at the edge represents one of the most promising approaches capable of addressing the issues that afflict centralized solutions. The aim of my research project is to deal with some of the open challenges involving the intersection of machine learning, networking and distributed computing. This area is very likely to become crucial in the current technological trend.

12:30 Tacchi Jack
Title: The sentiment of social ties in OSNs: The sentiment of ego network structures and its impact on patterns of information diffusion
The size of social groups that can successfully be maintained by humans has an upper limit that is determined by the size of our neocortex. Modern technologies, such as the internet, do not seem to have made it possible to increase the size of our social groups beyond our cognitively imposed limits, nor to have affected the ways in which we communicate. The ego network model has previously been applied to online contexts to model information diffusion in Online Social Networks, but has been limited in a few ways: namely, the strength of ties is usually measured quantitatively: e.g. via the frequency or recency of interactions. However, little effort has been devoted to characterising the interplay between such quantitative metrics and the "sentiment" of the social link. Specifically, the direction of the connections (i.e. whether they are positive or negative) has thus far rarely been considered, even though signed networks have been shown to provide additional information that can be incredibly useful for tasks such as community detection and the prediction of information diffusion. Therefore, this thesis aims to introduce natural language processing as a means of more accurately measuring the value, as well as the direction, of tie strength, to improve the accuracy of the ego network model, and to investigate how information diffusion patterns can be modelled more accurately by considering this richer source of information.

12:50 State Laura
Title: Logic-based Approaches to construct meaningful Explanations for Machine Learning Systems
Artificial Intelligence (AI) systems have a huge impact on our lives. As much as they positively shape the world, for example by supporting scientific discoveries, they bring responsibility, specifically when applied to social data. As shown in many application contexts (criminal justice, credit scoring, facial recognition, etc), they are susceptible to social biases. Even more, they have the potential to increase and systematize the harm done to already marginalized societal groups.
Additionally, the most effective AI based systems are considered Black Boxes (BB), as their internal logic is not understandable to humans, and several approaches have been recently developed to explain their reasoning processes. These tools are also of great importance to better understand social biases in AI systems.
This PhD project investigates logic-based approaches to explain BB decision systems. Verifying (causal) reasoning chains in the explanation of a single data instance is one such example. In that case, the explanation provides a factual part (the actual outcome) and a counterfactual part (the opposite outcome in case of a binary decision problem). Counterfactuals, providing answers to “what-if questions”, play an important role in the explanation process, as confirmed by research in social sciences. However, when being generated, they run into the danger of either changing features that are immutable or not actionable by the data subject, or invalidating causal relationships (e.g. lowering the age of a subject). By encoding these constraints into a background knowledge, logic-based reasoning can circumvent this problem.
Further, logic-based approaches can be used to compare explanations between single data instances, i.e. different data subjects concerned with the same decision pipeline, which is crucial when reasoning on individual fairness. Other main application areas concern reasoning over explanations and over time, making it possible to track changes of a BB classifier.
The PhD project will survey the legal background of explanation theories, relating to the “right of explanation” as stated in the GDPR, with the purpose of modeling them within the logic framework. Finally, it will evaluate the developed explanation system in an industrial context (insurance sector).

13:10 Pansanella Valentina
Title: Higher-order effects on opinion formation: networks, dynamics and biased agents
Understanding how opinion form and evolve is a complex yet crucial problem that our society needs to solve. Public and individual opinions are impor- tant, not only because they shape our culture, but also because they drive individual and – indirectly – collective actions, for example influencing po- litical decision. The literature on opinion dynamics models is wide, going from binary opinions and pairwise interactions models and moving towards continuous opinions on time-evolving higher-order systems, trying to narrow the gap between the models and the real systems. Despite this rich set of mathematical studies, when it comes to the actual validation of said models on real data, there is a scarcity of works in literature. Developing a unified framework to create a feedback loop between the creation of new opinion dynamics model and the validation of such models on real data is an open and challenging task. The goal of this project is therefore to advance current opinion dynamics models by exploiting the latest developments in complex network theory, especially in the field of temporal networks and higher-order systems. Moreover, the goal of the project is also to test the hypothesis of the main Opinion Dynamics models - which move from economy, sociology and psychology - employed up until now on real data from OSN along side with fitting the proposed model to such data.

Zircon - This is a contributing Drupal Theme
Design by WeebPal.