Data Science Ph.D. Day 2019

Type:

Workshop

Series:

2018/2019

Date:

Wednesday, June 12, 2019 - 15:00 to Friday, June 14, 2019 - 18:00

Location:

Aula Tonelli, Scuola Normale Superiore, Palazzo della Carovana, piazza dei Cavalieri 7, Pisa

Attachments:

2019-06-12-DataSciencePhdDay.pdf

2019-06-12-DataSciencePhdDay.png

Day One - June 12th 2019

15:00 — Deep-learning based analyses of mammograms to improve the estimation of breast cancer risk
Francesca Lizzi, Data Science Ph.D. student
Abstract: Breast cancer is the most diagnosed cancer among women worldwide. Survival rate strongly depends on early diagnosis and for this purpose mammo- graphic screening is performed in developed countries. In my PhD project, we want to study and apply new Artificial Intelligence based techniques in order to include and quantify fibroglandular (or dense) parenchyma in breast cancer risk models. Breast cancer is the most diagnosed cancer among women worldwide. It is widely accepted that one woman in eight is going to develop a breast cancer in her life. According to the last American Cancer Statistics, breast cancer is the second leading cause of death among women. Despite the increase in the incidence, breast cancer mortality is decreasing. This is mainly due to the breast cancer screening programs in which women in age 45-74 are called to have a mammographic exams every two years. Even if mammography is still the most used method to perform screening, it suffers from two inherent limitations: a low sensitivity (cancer detection rate) in women with dense breast parenchyma, and a low specificity causing unnecessary recalls. The low sensitivity in women with dense breasts is caused by a masking effect of overlying breast parenchyma. Furthermore, the summation of normal breast parenchyma on the conventional mammography occasionally may simulate a cancer. In recent years, new imaging techniques have been developed: tomosynthesis, which can produce a 3D and 2D synthetic images of the breast, new MRI techniques with contrast medium and breast CT. However, a lot of mammographic images can be collected from Hospitals to build large datasets on which it is possible to explore some AI techniques thanks to screening programs. In the last few years, new methods for image analysis have been developed. In 2012 for the first time, ImageNet Large Scale Visual Recognition Competition (ILSVRC), the most important image classification challenge worldwide, was won by a deep-learning based classifier named AlexNet. Starting from this result, the success of deep learning on visual perception problems is in- spiring many scientific works not only on natural images field but on medical images too [2]. Deep learning based techniques have the advantage of very high accuracy and predictive power at the expense of their interpretability. Furthermore, they usually need a huge amount of data to be trained and a large computational power. At the Italian National Institute for Nu- clear Physics (INFN), in the framework of the PhD in Data Science [L2] of Scuola Normale Superiore of Pisa, University of Pisa and the ISTI-CNR, we are working to apply deep learning models to find new image biomarkers extracted from screening mammograms able to early diagnose breast cancer. In a previous work [3], we trained and evaluated a breast parenchyma classifier in the BI-RADS standard, which is made of four qualitative density classes, using a deep convolutional neural network and we obtained very good results compared to other works. My PhD project is a continuation of this research with a larger dataset and more ambitious objectives. In fact, we are collect- ing data from Tuscany screening programs. The dataset is growing larger and currently includes: - 2000 mammographic exams (8000 images, four per subject) of healthy women labeled by the amount of fibroglandular tissue. These exams have been extracted from the Hospital of Pisa database. - 500 screen-detected cases of cancer, 90 interval cancer cases and 270 control ex- ams along with the histological reports and a questionnaire with the known breast cancer risk factors, such as parity, height, weight, familiar history and so on. For each woman, it is possible to access to all the mammograms prior to diagnosis. These exams have been extracted from the North-West Tuscany screening database. The goal of my PhD project is multifold and may be summarised as follows: - to explore the robustness of deep learn- ing algorithms with respect to the use of different mammographic systems, which usually result in different imaging properties. - to define a deep learn- ing model able to recognize the kind and nature of the malignant masses depicted in mammographic data based on the related histologic reports. - to investigate the inclusion of the fibroglandular parenchyma in breast cancer risk models in order to increase the predictive power of current risk prediction models. In this respect, changes in dense parenchyma will be monitored over time through image registration techniques, to understand how its vari- ation may influence cancer risk. Furthermore, we will investigate the role of dense tissue in the onset of interval cancers and the correlation among both local and global fibroglandular tissue and other known risk factors so as to quantify the risk in developing a breast cancer.

15:30 — Affective Computing by Multimodal Machine Learning
Benedetta Iavarone, Data Science Ph.D. student
Abstract: Affective computing is an emerging interdisciplinary field that has attracted increasing attention from various research areas, spanning from social science to cognitive science, from artificial intelligence to linguistics, from natural language processing to psychology. Through the union of all the different disciplines it encompasses, this field of research has the aim of enabling intelligent systems to recognize and interpret human emotions. Within the areas of sentiment analysis and emotion recognition, there already exist numerous attempts to give machines the capability to recognize, interpret and even express emotions or sentiments. However, a great majority of the previous approaches involve the use of a single modality, and most of the multimodal approaches found in the literature combine almost always the same set of modalities (e.g. audiovisual systems or speech+text systems), lacking the exploration of different combinations that could possibly enhance the performances of the systems that detect emotions and sentiments. Thus, among the challenges still open in the field of affective computing there is the necessity of exploring more features combinations and to find a univocal and effective method both for fusing the features and using them to train and test affect detector systems. In recent years, it has been seen that deep machine learning models can enhance the performances of affect detectors, but there are still many challenges to be tackled. The aim of this work is to propose a machine learning system capable of detecting emotions in a real-world context, selecting modalities that have been rarely, if never, fused together in this kind of studies.

16:00 — Developing a multi-layer framework for social policy evaluation. An Italian case study: the Citizens’ Income
Giovanni Tonutti, Data Science Ph.D. student
Abstract: In April 2019, a new welfare and active labour market policy has been introduced in Italy: the Reddito di Cittadinanza (RdC) or Citizens’ Income. This flagship measure is aimed at i) tackling poverty, ii) increasing participation into the labour market and iii) improving social inclusion. This study aims at developing a comprehensive framework for the monitoring and evaluation of this policy against its three main objectives, exploring the extent to which a mixed set of conventional and unconventional data-sources can be employed for this type of assessment. The emphasis will be on measuring and assessing the policy impact at the regional, sub-regional and local level. Firstly, it will assess the extent to which the policy is successfully targeting support to families in poverty in Italy, and the impact of its redistributive effects on the incidence of poverty across and within regions. Small area estimators’ techniques will be employed to develop a set of regional and sub- regional benchmarks against which the policy can be assessed. Secondly, the study will attempt to develop a methodology for monitoring employers’ responses and changes in the demand for labour in near real time. An analysis of online job vacancies and their classification into ISCO-08 job categories through text analysis techniques will allow to map the availability of job offers and skills required against the distribution of RdC claimants subject to work conditionality. Finally, the study will focus on those measures of the policy aiming at improving the social inclusion of claimants with no work conditionality requirements. Administrative data on this group of households collected by the Comune di Torino (Turin City Council) will be analysed to identify and map levels of need against the offer of support available across the city, tracking local interventions and monitoring outcomes.

16:30 — Using data science to uncover cognitive constraints in human behavior beyond social interactions
Kilian Ollivier, Data Science Ph.D. student
Abstract: Social interactions are ruled by a set of cognitive constraints. That is a very general trait of the human social behavior, as they have been highlighted across many different social interaction means. These constraints have a significant impact on various phenomena, such as information diffusion and social relationship turnaround. The key research question of my phd thesis is whether similar constraints emerge and impact on other aspects of the human behaviour, beyond social interactions. First of all, we want to focus on the cognitive constraints and limits behind the use of language by humans, and how they manifest themselves from large-scale datasets such as Twitter. A final objective would be to connect the outcome of our data-driven research to existing neuro-linguistics models that already explain how the brain tackles the complexity of language production.

17:00 — Discussion and networking with refreshments

Day Two - June 14th 2019

15:00 — Unveiling the underlying mechanisms for the conservation of biodiversity and ecosystem services inside Mexican Protected Areas using Machine Learning
Ivan Alejandro Ortiz-Rodríguez, Data Science Ph.D. student
Abstract: Protected areas (PAs) have been established by civil society and by governments to conserve biodiversity and ecosystem services, and as an instrument to mitigate the effects of the environmental crisis. Currently, there are more than 200,000 protected areas around the world covering more than 20 million km2 (14.9% of the Earth's surface), and, in the recent years, they have increased significantly in number and extension to reach international agreements such as the Aichi Biodiversity Targets. However, PAs are not always established following ecological conservation guidelines, their design and management strategies follow no standard criteria, and in some cases, there is a significant human influence in their interior. Considering that ecosystems represent continuous habitats with flows of matter, energy, and non-static populations, recent studies have questioned the effectiveness of PAs for the conservation of biodiversity and ecosystem functioning, especially when these are embedded in landscapes threatened by human population growth, agricultural expansion, illegal logging, insecure property rights, and other invasive management strategies. The case of Mexico -one of the biodiversity hotspots in the world- is particular because its strategy for the establishment of PAs has included many different forms of design and management, allowing, in several cases, the economic development of the communities that inhabit them. In this sense, it is imperative to evaluate the effectiveness of Mexican PAs to conserve biodiversity and ecosystem services and identify the best management strategies to achieve their objectives. In this project, I propose the use of machine learning algorithms for assessing the following questions: (I) Which is the current state of biodiversity and ecosystem services inside Mexican PAs?; (ii) which are the most influential factors for predicting its levels of biodiversity and ecosystem services?; (iii) What are the management strategies that facilitate their conservation of biodiversity and ecosystem services?; (iv) Are PAs really helping for conserving biodiversity and ecosystem services?; and (v) is there a need for re-design the Mexican PAs system?. The results of these study are essential to unravel the cascading effects of human pressures on protected ecosystems, as well as to provide the basis for the implementation of well- designed public policies for the conservation and management of ecosystems in Mexico and worldwide.

15:30 — How can ensemble non-equivalence affect data compressibility, reconstruction and pattern detection?
Andrea Somazzi, Data Science Ph.D. student
Abstract: On one hand, recent progresses in network theory have resulted in identifying new scenarios in which the hard-constraints (microcanonical) ensemble and the soft-constraints (canonical) ensemble, commonly derived through Shannon entropy maximization, are not equivalent in the large N limit. On the other hand, new studies have investigated how Shannon entropy has been derived, finding that it is actually a particular case of a wider class of entropies, more suitable to describe a variety of systems. Maximization procedure of such entropies, enforcing soft constraints, gives generalized probability distributions. My idea is to apply this technique to complex networks, enforcing soft constraints as, for example, the average degree sequence: the resulting prob- ability distributions will be defined over ensembles that will be, in general, different from the ones obtained through Shannon entropy. I will then com- pare these new distributions with the ones describing the hard-constraints ensembles in order to explore their equivalence in the large N limit. The key usefulness of this study comes from the fact that the soft- constraints ensemble is a powerful tool to deal with many problems that would be impossible to treat in terms of hard constraints: from this point of view, it is crucial to understand whether they are equivalent or not. In fact, after this theoretical and general analysis, I will focus on its applications in data compressibility, reconstruction, pattern detection and quantum information.

16:00 — Robust and transparent long short-term memory (LSTM) machine learning for time series prediction
German Rodikov, Data Science Ph.D. student
Abstract: Time series prediction is an important subject in economics, finance, and other fields. There are a bunch of approaches to forecast the next point of time series, for instance Exponential Smoothing, Moving Average, Last Value, and many variations of Autoregressive Integrated Moving Average (ARIMA). In the last decades, many advanced machine learning algorithms have been developed to predict sequences, for example LSTM (long-short-term Memory) has demonstrated its outperformance in precision and accuracy of time series predictions. The one obvious issue with machine learning algorithms is low transparency. Usually, the attention of neural networks focuses on temporal aspects of data. There is a lack of interpretation of RNN discussion. But also it is important in real datasets to understand the variables and what is taken into account for the particular decision of RNN. Moreover, real datasets are often complicated with anomalies (outliers). That can lead the RNN to deviate from the underlying patterns of the time series and to decrease the accuracy of prediction. This research is investigating how and why machine learning algorithms are superior to traditional approaches, also attempting to investigate the black box transparency issue and to build a more robust algorithm for data containing outliers.

16:30 — Analysis and reconstruction of microscopic production networks
Leonardo Niccolò Ialongo, Data Science Ph.D. student
Abstract: The aim of the research project is to use the machinery of maximum entropy to investigate the relationship between aggregate economic variables. Using data from input-output matrices (e.g. World Input-Output Database), from international trade (e.g. UN Comtrade), and from firm level budget data (e.g. Orbis), we aim at identifying the heterogeneous constraints of production faced by firms in each sector. We will then attempt to use these constraints to build canonical network ensembles that allow us to explore how these constraints affect the relationship between macroeconomic variables at the thermodynamic equilibrium. If successful, it should then be possible to reconstruct the network of interactions between economic agent at a microscopic level.

17:00 — Individual human mobility modeling and prediction
Agnese Bonavita, Data Science Ph.D. student
Abstract:
Nowadays understanding and predicting human mobility is essential to a large number of applications, ranging from recommendations to safety and urban service planning. In some travel applications, the ability to accurately predict the user's future trajectory is vital for delivering high quality of service. The accurate prediction of detailed trajectories would empower location-based service providers with the ability to deliver more precise recommendations to users. Several questions are still open for investigation, from general ones such as: How to capture the multidimensional and multi-scale of individual mobility in a single model?, to specific ones such as: What are the frequent pattern of people's travels? How big data attractors and extraordinary events influence mobility? How to predict the car accident risk in the near future? Starting with the Global Positioning System (GPS) tracks only, the goal of my research project is to find new answers to those questions focusing the attention on the study of the individual mobility. The objective is to model the mobility of the single individual as a whole, creating a unique, complete picture of it adding semantics to the raw data. In later steps, additional data sources, such as geospatial context and other mobility data types, like WiFi sensors, GSM mobile phone towers and users check-in data on social networks (i.e. extracted from Foursquare, Twitter or Facebook) will be used to get more enriched results. The human mobility framework considered here presents two very important and interesting open challenges: the transfer learning method for mobility data models and the explainability aspects behind risk event recognition. The former aims to find a way to exploit a pre-trained model of trajectory prediction in another geographical setting. Building a universal algorithm able to predict the future tracks of users would be very useful in those context where data availability is low. The latter aims to understand how the Black Box works: usually machine learning algorithms map user features into classes or scores without explaining why and how, because the decision model is either not comprehensible to stakeholders, or secret. This is worrying not only for the lack of transparency, but also due to the possible hidden biases. In the field of mobility and accident prediction there is still lot of work to do and a wide range of open scenarios to explore. One of the most interesting challenges is trying to transform predictions into "prescriptive rules" to prevent risk phenomena (for example car accidents in the insurance industry or heart attacks in the medical field).

17:30 — Discussion and networking with refreshments

Main menu