A Multi-Representation Re-Ranking Model for Personalized Product SearchtesE. Bassani, G. Pasi
In recent years, a multitude of e-commerce websites arose.
Product Search is a fundamental part of these websites, which is often managed as a traditional retrieval task.
However, Product Search has the ultimate goal of satisfying specific and personal user needs, leading users to find and purchase what they are looking for, based on their preferences. To maximize users’ satisfaction, Product Search should be treated as a personalized task.
In this paper, we propose and evaluate a simple yet effective personalized results re-ranking approach based on the fusion of the relevance score computed by a well-known ranking model, namely BM25, with the scores deriving from multiple user/item representations.
Our main contributions are:
1) we propose a score fusion-based approach for personalized re-ranking that leverages multiple user/item representations,
2) our approach accounts for both content-based features and collaborative information (i.e. features extracted from the user-item interactions graph),
3) the proposed approach is fast and scalable, can be easily added on top of any search engine and it can be extended to include additional features.
The performed comparative evaluations show that our model can significantly increase the retrieval effectiveness of the underlying retrieval model and, in the great majority of cases, outperforms modern Neural Network-based personalized retrieval models for Product Search.
On Building Benchmark Datasets for Understudied Information Retrieval Tasks: the Case of Semantic Query LabelingE. Bassani, G. Pasi
In this manuscript, we review the work we undertake to build a large-scale benchmark dataset for an understudied Information Retrieval task called Semantic Query Labeling.
This task is particularly relevant for search tasks that involve structured documents, such as Vertical Search, and consists of automatically recognizing the parts that compose a query and unfolding the relations between the query terms and the documents’ fields.
We first motivate the importance of building novel evaluation datasets for less popular Information Retrieval tasks.
Then, we give an in-depth description of the procedure we followed to build our dataset.
Consumer Health Search at CLEF eHealth 2021L. Goeuriot, H. Suominen, G. Pasi, E. Bassani, N. Brew-Sam, G. González-Sáez, L. Kelly, P. Mulhem, S. Seneviratne, R. Upadhyay, M. Viviani & C. Xu
“This paper details materials, methods, results, and analyses of the Consumer Health Search Task of the CLEF eHealth 2021 Evaluation Lab.
This task investigates the effectiveness of information retrieval (IR) approaches in providing access to medical information to laypeople. For this a TREC-style evaluation methodology was applied: a shared collection of documents and queries is distributed, participants’ runs received, relevance assessments generated, and participants’ submissions evaluated.
The task generated a new representative web corpus including web pages acquired from a 2021 CommonCrawl and social media content from Twitter and Reddit, along with a new collection of 55 manually generated layperson medical queries and their respective credibility, understandability, and topicality assessments for returned documents.
This year’s task focused on three subtask: (i) ad-hoc IR, (ii) weakly supervised IR, and (iii) document credibility prediction.
In total, 15 runs were submitted to the three subtasks: eight addressed the ad-hoc IR task, three the weakly supervised IR challenge, and 4 the document credibility prediction challenge.
As in previous years, the organizers have made data and tools associated with the task available for future research and~development.”
Semantic Query Labeling Through Synthetic Query GenerationE. Bassani, G. Pasi
Searching in a domain-specific corpus of structured documents (e.g., e-commerce, media streaming services, job-seeking platforms) is often managed as a traditional retrieval task or through faceted search. Semantic Query Labeling – the task of locating the constituent parts of a query and assigning domain-specific predefined semantic labels to each of them – allows leveraging the structure of documents during retrieval while leaving unaltered the keyword-based query formulation.
Due to both the lack of a publicly available dataset and the high cost of producing one, there have been few published works in this regard.
In this paper, basing on the assumption that a corpus already contains the information the users search, we propose a method for the automatic generation of semantically labeled queries and show that a semantic tagger – based on BERT, gazetteers-based features, and Conditional Random Fields – trained on our synthetic queries achieves results comparable to those obtained by the same model trained on real-world data.
We also provide a large dataset of manually annotated queries in the movie domain suitable for studying Semantic Query Labeling.
We hope that the public availability of this dataset will stimulate future research in this area.
A Language Model based Approach for PhD candidates Profiling in a Recruiting SettingA. Azzini, S. Marrara, N. Cortesi, A. Topalovic
In the last decade, students facing a PhD course in Europe find terrible difficulties in reaching a permanent position in the academy. The situation gets worse when graduated PhDs have to migrate to public/private organisations that are not always ready to understand and improve the research experience. In such a situation, one of the most critical aspects is encountered immediately in the recruitment phase, since the keywords used in job offers portals are based on the employers’ vocabulary and usually do not match the words that a researcher would use to describe her/his experience. Therefore, it is widely recognised that there is a need to define a system that can support a recruiters team in recruiting PhDs. The approach presented in this paper aims at designing a decision support tool able to guide the choices of recruiter of any company in the evaluation of profiles of candidates with PhD.
Advances in Data Management in the Big Data EraAzzini, A., Barbon Jr, S., Bellandi, V., Catarci, T., Ceravolo, P., Cudré-Mauroux, P., ... & Wrembel, R.
Highly-heterogeneous and fast-arriving large amounts of data,
otherwise said Big Data, induced the development of novel Data Management technologies. In this paper, the members of the IFIP Working Group 2.6 share their expertise in some of these technologies, focusing on: recent advancements in data integration, metadata management, data quality,graph management, as well as data stream and fog computing are dis-
Rules-based process mining to discover PLM system processesA. Azzini, P. Ceravolo, A. Corallo, E. Damiani, M. Lazoi, M. Marra
The value of product lifecycle management systems (PLMS) is more and more recognised by companies and its use current has enormously increased. It is mainly used during the product design when different roles collaborate for sharing models, take review decisions, and approve or reject preliminary results. Often, companies have a general ‘picture’ about the processes involving PLMS (who performs an activity, when it is performed, what is done) but this knowledge can be reinforced, improved and modified using process mining. Here the knowledge is extracted from the event logs, and model-aware analytics are generated to evaluate the modelled, known and executed process. The business rules filter the logs and verify the impact on the process mining metrics to minimise the divergences between modelled and actual processes and improve the resulting quality metrics. The results help business users to identify lines of investigation for deviations from expected behaviour and propose improvement measures.
Data Mining Applications in SMEs: An Italian PerspectiveA. Azzini, A. Topalovic
Background: From the last decade, data mining techniques, employed in particular in customer relationship management, have assumed a key role in the profitability and operations of companies. To support small and medium companies (SMEs), several innovative and continuously improving tools have been developed that allow SMEs to utilize the internal and external data sources to increase their competitiveness. Objectives: In this paper, an analysis of the impact of digitalization, and in particular data mining techniques, in the context of SMEs development is presented. Methods/Approach: A review of various sources has been conducted, with the focus on open source tools, since in the context of the Italian economy they are used by SMEs the most. Results: First, the analysis presents a brief review of the data mining techniques available and shows how they are practically employed in small companies. Second, an economical review of investments in data mining projects in Italy is presented. Conclusions: The review indicates that data mining techniques can boost a company in the market. However, the awareness of data mining as a company asset is still not strong in Italian SMEs and most investments in Italy are still carried out by large companies.
Knowledge Management in the Italian SMEs, the role of ICTA. Azzini, S. Marrara, A. Topalovic
In this work an analysis of the role of the latest ICT trends in the Italian SMEs is reported. The starting point, and main source, of this analysis is the Assintel Report 2019 provided by the Italian Association of ICT enterprises, Assintel. In this report, the ICT market and the digital evolution is analysed in terms of trends, money volumes and global scenarios; starting from this point we analyse how knowledge management is changing in the Italian SMEs, due to the increasing spread of ICT techniques. As shown in the paper, Italian companies see ICT as a carrier of innovation and optimisation inside the organisation and outside, to provide new services and products, but they are still very “prudent” in the investments, due to a difficulty to find personnel with the right skills both internally and in the recruitment market. Moreover, the perception of the role of IT Security has always been particularly problematic in Italy. The analysis shows that the development of a specific culture requires a radical evolution in the understanding of the IT risk that companies of all levels face when they connect to a network.
Hyperledger, una famiglia di piattaforme per la blockchain nell’industriaA. Azzini, N. Cortesi
Il continuo sviluppo del mercato e la sempre più affermata innovazione digitale hanno portato le imprese italiane ad affrontare, nell’ultimo decennio, sfide sempre più accattivanti ma al contempo impegnative.
La tecnologia Blockchain è fra queste una delle più interessanti sul mercato, con un’espansione che vede coinvolti sempre più settori, oltre a quello bancario e finanziario.«Letteralmente definita “catena di blocchi”, la Blockchain, sfrutta le caratteristiche di una rete informatica di nodi, permettendo di gestire e aggiornare, in modo sicuro e univoco, un registro contenente dati e informazioni (ad esempio, transazioni) in maniera aperta, condivisa e distribuita senza la necessità di un’entità centrale di controllo e verifica».
La trasparenza delle transazioni, il tracciamento delle operazioni, la non modificabilità delle procedure, l’efficace gestione delle forniture (ad esempio, in una filiera industriale), sono alcuni dei principali vantaggi offerti da questa tecnologia.
L’AI nell’anatomia patologicaA. Belfatto, C. Spreafico
Diapath SpA e il Consorzio per il Trasferimento Tecnologico C2T hanno unito le forze per affrontare una delle ultime frontiere dell’intelligenza artificiale in campo medico, ideando una piattaforma IoT in grado di creare una rete intelligente fra i macchinari di analisi patologica, a garanzia di una diagnosi più sicura
Automatically assessing the quality of Wikipedia contentsE. Bassani, M. Viviani
With the development of Web 2.0 technologies, people have gone from being mere content users to content generators. In this context, the evaluation of the quality of (potential) information available online has become a crucial issue. Nowadays, one of the biggest online resources that users rely on as a knowledge base is Wikipedia. The collaborative aspect at the basis of Wikipedia can let to the possible creation of low-quality articles or even misinformation if the process of monitoring the generation and the revision of articles is not performed in a precise and timely way. For this reason, in this paper, the problem of automatically evaluating the quality of Wikipedia contents is considered, by proposing a supervised approach based on Machine Learning to perform the classification of articles on qualitative bases. With respect to prior literature, a wider set of features connected to Wikipedia articles has been taken into account, as well as previously unconsidered aspects connected to the generation of a labeled dataset to train the model, and the use of Gradient Boosting, which produced encouraging results.
Quality of Wikipedia Articles: Analyzing Features and Building a Ground Truth for Supervised ClassificationE. Bassani, M. Viviani
Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is (i) on the analysis of groups of hand-crafted features that can be employed by supervised machine learning techniques to classify Wikipedia articles on qualitative bases, and (ii) on the analysis of some issues behind the construction of a suitable ground truth. Evaluations are performed, on the analyzed features and on a specifically built labeled dataset, by implementing different supervised classifiers based on distinct machine learning algorithms, which produced promising results.
Radar: A Framework for automated ReportingA. Azzini, N. Cortesi, A. Topalovic, G. Psaila
Large companies and organizations periodically feed their information systems with large data flows. Apart from the classical operational activities, they are called to prepare aggregated reports to send to institutions and rating agencies. Unfortunately, organizations typically suffer for the lack of integrated data and for the lack of a standard data dictionary. The presented approach aims to tackle such a problem by building a bridge between employees that need to specify how to generate reports (on the basis of concepts and terms typical of the application domain) and the information system that stores the data to query and aggregate in order to automatically produce reports. The implemented framework, RADAR (Rich Advanced Design Approach for Reporting), moves from the notion of Operational Data Store, and it is posed in the middle between an ontology (of concepts and terms) and the actual operational (and relational) schema of source data. Then, in the defined schema allows for giving a high-level view of such source data, based on concepts described in the ontology for a specific application domain.
Promoting the employability of PhDs in OrganizationsA. Azzini, S. Marrara, A. Topalovic
Find Your Doctor (FYD) is the rst Job-placement agency in Italy dedicated to
PhDs who are leaving the Academia to continue their professional path in compa-
nies and organizations. The mission of FYD is to outline the value of the research
background as an asset for the development of companies and society as a whole.
In this tutorial we provide a survey of the activities that European Organizations
are currently proposing to promote PhDs’ careers. Moreover a description of the
techniques that are currently employed in recruitment software is presented.
Evolving Fuzzy Membership Functions for Soft Skills Assessment OptimizationA. Azzini, S. Marrara, A. Topalovic
This work proposes the design of a decision support tool able to guide the choices of any company HR manager in the evaluation of the profiles of PhD candidates. This paper is part of an ongoing research in the field of PhD profiling. The novelty here is an evolutionary fuzzy model, based on the Membership Functions (MFs) optimization, used to obtain the soft skills candidate profiles. The general aim of the project is the definition of a set of fuzzy rules that are very similar to those that a HR expert would otherwise have to calculate each time for each selected profile and for each individual skill.
Performances of OLAP Operations in Graph and Relational DatabasesA. Azzini, P. Ceravolo, M. Colella
“The increasing volume of data created and exchanged in distributed architectures has made databases a critical asset to ensure availability and reliability of business operations. For this reason, a new family of databases, called NoSQL, has been proposed. To better understand the impact this evolution can have on organizations it is useful to focus on the notion of Online Analytical Processing (OLAP). This approach identifies techniques to interactively analyze multidimensional data from multiple perspectives and is today essential for supporting Business Intelligence.
The objective of this paper is to benchmark OLAP queries on relational and graph databases containing the same sample of data. In particular, the relational model has been implemented by using MySQL while the graph model has been realized thanks to the Neo4j graph database. Our results, confirm previous experiments that registered better performances for graph databases when re-aggregation of data is required. “
A Multi-Label Machine Learning Approach to Support Pathologist’s Histological AnalysisA. Azzini, S. Marrara, N. Cortesi, A. Topalovic
This paper proposes a new tool in the field of telemedicine, defined as a specific branch where IT supports medicine, in case distance impairs the proper care to be delivered to a patient. All the information contained into medical texts, if properly extracted, may be suitable for searching, classification, or statistical analysis. For this reason, in order to reduce errors and improve quality control, a proper information extraction tool may be useful. In this direction, this work presents a Machine Learning Multi-Label approach for the classification of the information extracted from the pathology reports into relevant categories. The aim is to integrate automatic classifiers to improve the current workflow of medical experts, by defining a Multi-Label approach, able to consider all the features of a model, together with their relationships.
“Feature Analysis for Assessing the Quality of Wikipedia Articles
through Supervised Classification”
Nowadays, thanks to Web 2.0 technologies, people have the possibility to generate and spread contents on different social media in a very easy way. In this context, the evaluation of the quality of the information that is available online is becoming more and more a crucial issue. In fact, a constant flow of contents is generated every day by often unknown sources, which are not certified by traditional authoritative entities. This requires the development of appropriate methodologies that can evaluate in a systematic way these contents, based on `objective’ aspects connected with them. This would help individuals, who nowadays tend to increasingly form their opinions based on what they read online and on social media, to come into contact with information that is actually useful and verified. Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is on the analysis of hand-crafted features that can be employed by supervised machine learning techniques to perform the classification of Wikipedia articles on qualitative bases. With respect to prior literature, a wider set of characteristics connected to Wikipedia articles are taken into account and illustrated in detail. Evaluations are performed by considering a labeled dataset provided in a prior work, and different supervised machine learning algorithms, which produced encouraging results with respect to the considered features.
New Trends of Fuzzy Systems: Fintech ApplicationsA. Azzini, S. Marrara, A. Topalovic
In the last years, the term Financial Technology (FinTech) has been adopted by literature to describe a wide range of services, aided by several financial technologies, for enterprises or organizations, which mainly address the improvement of the service quality by using Information Technology (IT) applications.
Big Data SemanticsP. Ceravolo, A. Azzini, M. Angelini, T. Catarci, P. Cudré-Mauroux, E. Damiani, A. Mazak, M. Van Keulen, M. Jarrar, G. Santucci, K.U. Sattler, M. Scannapieco, M. Wimmer, R. Wrembel & F. Zaraket
Big Data technology has discarded traditional data modeling approaches as no longer applicable to distributed data processing. It is, however, largely recognized that Big Data impose novel challenges in data and infrastructure management. Indeed, multiple components and procedures must be coordinated to ensure a high level of data quality and accessibility for the application layers, e.g., data analytics and reporting. In this paper, the third of its kind co-authored by members of IFIP WG 2.6 on Data Semantics, we propose a review of the literature addressing these topics and discuss relevant challenges for future research. Based on our literature review, we argue that methods, principles, and perspectives developed by the Data Semantics community can significantly contribute to address Big Data challenges.
A Neuro-Fuzzy Approach to assess the soft skills profile of a PhDA. Azzini, S. Marrara, A. Topalovic
In this paper a framework aimed at representing the soft skills profile of a job seeker by means of a 2-tuple fuzzy linguistic approach and a Neuro fuzzy controller is presented.
The framework can be used in many contexts, in this work it is employed for designing a recommender system of candidates for recruiting agencies. The recommender system’s Neuro fuzzy controller simulates the decision of a Human Resource (HR) manager in evaluating the soft skills profile of a candidate and proposes only the best profiles w.r.t. a set of preferences. The framework has been developed in the context of the Find Your Doctor (FYD) start up and applied to the PhD recruiting task, but it is easily applicable to any recruiting activity.
Overview of the CLEF 2018 Personalised Information Retrieval Lab (PIR-CLEF 2018)G. Pasi, G. J. F. Jones, K. Curtis, S. Marrara, C. Sanvitto, D. Ganguly, P. Sen
At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to provide an initiative aimed at both providing and critically analysing a new approach to the evaluation of personalization in Information Retrieval (PIR). PIR-CLEF 2018 is the first edition of this Lab after the successful Pilot lab organised at CLEF 2017. PIR CLEF 2018 has provided registered participants with the data sets originally developed for the PIR-CLEF 2017 Pilot task; the data collected are related to real search sessions over a subset of the ClueWeb12 collection, undertaken by 10 users by using a novel methodology. The data were gathered during the search sessions undertaken by 10 volunteer searchers. Activities during these search sessions included relevance assessment of a retrieved documents by the searchers. 16 groups registered to participate at PIR-CLEF 2018 and were provided with the data set to allow them to work on PIR related tasks and to provide feedback about our proposed PIR evaluation methodology with the aim to create an effective evaluation task.
Automated Monitoring of Collaborative Working Environments for Supporting Open InnovationM. M. Khani, P. Ceravolo, A. Azzini, E. Damiani
Open Innovation is a complex procedure that requires effective management and control along the different stages of the overall process. The automated monitoring of Open Innovation is the aim of a collaborative working environment designed, developed and tested in our research labs. This paper illustrates our solution and provides an assessment of the monitoring capabilities implemented. In particular, we propose a data model with a general approach for defining metrics and a list of metrics enabling automated monitoring with an evaluation of their informative power.
SOON: Supporting the Evaluation of Researchers’ ProfilesA. Azzini, A. Galimberti, S. Marrara, E. Ratti
Find Your Doctor (FYD) is the first Job-placement agency in Italy dedicated to PhDs who are leaving the Academia to continue their professional path in companies and organizations. The mission of FYD is to outline the value of the research background as an asset for the development of companies and society as a whole. For this reason we started a research project aimed at building SOON, Skills Out Of Narrative, a HR supporting tool able to extract from the text provided by a person telling his/her experience a set of well defined skills, both soft and hard, creating a profile. The final aim of the project is to produce a list of candidates ranked on the basis of the degree of similarity of their profile w.r.t. the profile required for a certain job position or activity. This paper describes the full architecture of SOON and the idea at the basis of the FYD mission.
A Classifier to Identify Soft Skills in a Researcher Textual DescriptionA. Azzini, A. Galimberti, S. Marrara, E. Ratti
Find Your Doctor (FYD) aims at becoming the first Job-placement agency in Italy dedicated to PhDs who are undergoing the transition outside Academia. To support the FYD Human Resources team we started a research project aimed at extracting, from texts (questionnaires) provided by a person telling his/her experience, a set of well defined soft skills. The final aim of the project is to produce a list of researchers ranked w.r.t. their degree of soft skills ownership. In the context of this project, this paper presents an approach employing machine learning techniques aimed at classifying the researchers questionnaires w.r.t. a pre-defined soft skills taxonomy. This paper also presents some preliminary results obtained in the “communication” area of the taxonomy, which are promising and worth of further research in this direction.
MMBR: A Report-driven Approach for the Design of Multidimensional ModelsA. Azzini, S. Marrara, A. Maurino, A. Topalovic
Nowadays, large organizations and regulated markets are subject to the control activity of external audit associations that require huge amounts of information to be submitted in the form of predefined and rigidly structured reports. Compiling these reports requires one to extract, transform and integrate data from several heterogeneous operational databases. This task is usually performed by developing a different ad-hoc and complex software for each report. Another solution involves the adoption of a data warehouse and related tools, which are today well-established technologies. Unfortunately, the data warehousing process is notoriously long and error-prone, therefore it is particularly inefficient when the output of the data warehouse is a limited number of reports. This article presents MMBR, an approach able to generate a multidimensional model starting from the structure of the reports expected as output of the data warehouse. The approach is able to generate the multidimensional model, and to populate the data warehouse by defining a domain-specific knowledge base. Even if using semantic information in data warehousing is not new, the novel contribution of our approach is the idea to simplify the design phase of the data warehouse, and make it more efficient, by using a domain specific knowledge base and a report-driven approach.