Back

Aritz Pérez

Postdoc Fellow

Machine Learning

T +34 946 567 842
F +34 946 567 842
E aperez@bcamath.org

Information of interest

Orcid: 0000-0002-8128-1099

About me
BCAM Bird Publications
Projects
Software/Simulations

Postdoc Fellow at BCAM. The main methodological research lines include probabilistic graphical models, supervised classification, information theory, density estimation and feature subset selection. The methodological contributions have been applied to the fields of bioinformatics (genetics and epigenetics) and ecological modelling (fisheries).

Unsupervised learning approaches for disease progression modeling

Zaballa, O.; Pérez, A.; Lozano, J.A. (2024-01-12)

Speeding-Up Evolutionary Algorithms to Solve Black-Box Optimization Problems

Echevarrieta, J.; Arza, E.; Pérez, A. (2024-01-10)

Population-based evolutionary algorithms are often considered when approaching computationally expensive black-box optimization problems. They employ a selection mechanism to choose the best solutions from a given population ...

A Probabilistic Generative Model to Discover the Treatments of Coexisting Diseases with Missing Data

Zaballa, O.; Pérez, A.; Lozano, J.A. (2024-01-01)

Large-scale unsupervised spatio-temporal semantic analysis of vast regions from satellite images sequences

Echegoyen, C.; Santafé, G.; Pérez-Goya, U.; Ugarte, M. D.; Pérez, A.; Ugarte (2024)

Temporal sequences of satellite images constitute a highly valuable and abundant resource for analyzing regions of interest. However, the automatic acquisition of knowledge on a large scale is a challenging task due to ...

Efficient Learning of Minimax Risk Classifiers in High Dimensions

Bondugula, K.R.; Mazuelas, S.; Pérez, A. (2023-08-01)

High-dimensional data is common in multiple areas, such as health care and genomics, where the number of features can be tens of thousands. In such scenarios, the large number of features often leads to inefficient ...

Fast K-Medoids With the l_1-Norm

Capó, M.; Pérez, A.; Lozano, J.A. (2023-07-26)

K-medoids clustering is one of the most popular techniques in exploratory data analysis. The most commonly used algorithms to deal with this problem are quadratic on the number of instances, n, and usually the quality of ...

Fast Computation of Cluster Validity Measures for Bregman Divergences and Benefits

Capó, M.; Pérez, A.; Lozano, J.A. (2023)

Partitional clustering is one of the most relevant unsupervised learning and pattern recognition techniques. Unfortunately, one of the main drawbacks of these methodologies refer to the fact that the number of clusters is ...

Learning the progression patterns of treatments using a probabilistic generative model

Zaballa, O.; Pérez, A.; Gómez-Inhiesto, E.; Acaiturri-Ayesta, T.; Lozano, J.A. (2022-12-15)

Modeling a disease or the treatment of a patient has drawn much attention in recent years due to the vast amount of information that Electronic Health Records contain. This paper presents a probabilistic generative model ...

Implementing the Cumulative Difference Plot in the IOHanalyzer

Arza, E.; Ceberio, J.; Irurozki, E.; Pérez, A. (2022-07)

The IOHanalyzer is a web-based framework that enables an easy visualization and comparison of the quality of stochastic optimization algorithms. IOHanalyzer offers several graphical and statistical tools analyze the results ...

An active adaptation strategy for streaming time series classification based on elastic similarity measures

Oregi, I.; Pérez, A.; Del Ser, J.; Lozano, J.A. (2022-05-21)

In streaming time series classification problems, the goal is to predict the label associated to the most recently received observations over the stream according to a set of categorized reference patterns. In on-line ...

Generalized Maximum Entropy for Supervised Classification

Mazuelas, S.; Shen, Y.; Pérez, A. (2022-04)

The maximum entropy principle advocates to evaluate events’ probabilities using a distribution that maximizes entropy among those that satisfy certain expectations’ constraints. Such principle can be generalized for ...

Rank aggregation for non-stationary data streams

Irurozki, E.; Pérez, A.; Lobo, J.L.; Del Ser, J. (2022)

The problem of learning over non-stationary ranking streams arises naturally, particularly in recommender systems. The rankings represent the preferences of a population, and the non-stationarity means that the distribution ...

Comparing Two Samples Through Stochastic Dominance: A Graphical Approach

Arza, E.; Ceberio, J.; Irurozki, E.; Pérez, A. (2022)

Nondeterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples ...

On the relative value of weak information of supervision for learning generative models: An empirical study

Hernández, J.; Pérez, A. (2022)

Weakly supervised learning is aimed to learn predictive models from partially supervised data, an easy-to-collect alternative to the costly standard full supervision. During the last decade, the research community has ...

On the use of the descriptive variable for enhancing the aggregation of crowdsourced labels

Beñaran-Muñoz, I.; Hernández, J.; Pérez, A. (2022)

The use of crowdsourcing for annotating data has become a popular and cheap alternative to expert labelling. As a consequence, an aggregation task is required to combine the different labels provided and agree on a single ...

Machine learning from crowds using candidate set-based labelling

Beñaran-Muñoz, I.; Hernandez, J.; Pérez, A. (2022)

Crowdsourcing is a popular cheap alternative in machine learning for gathering information from a set of annotators. Learning from crowd-labelled data involves dealing with its inherent uncertainty and inconsistencies. In ...

Dirichlet process mixture models for non-stationary data streams

Casado, I.; Pérez, A. (2022)

In recent years, we have seen a handful of work on inference algorithms over non-stationary data streams. Given their flexibility, Bayesian non-parametric models are a good candidate for these scenarios. However, reliable ...

Non-parametric discretization for probabilistic labeled data

Flores, J.L.; Calvo, B.; Pérez, A. (2022)

Probabilistic label learning is a challenging task that arises from recent real-world problems within the weakly supervised classification framework. In this task algorithms have to deal with datasets where each instance ...

LASSO for streaming data with adaptative filtering

Capó, M.; Pérez, A.; Lozano, J.A. (2022)

Streaming data is ubiquitous in modern machine learning, and so the development of scalable algorithms to analyze this sort of information is a topic of current interest. On the other hand, the problem of l1-penalized ...

Are the statistical tests the best way to deal with the biomarker selection problem?

Urkullu, A.; Pérez, A.; Calvo, B. (2022)

Statistical tests are a powerful set of tools when applied correctly, but unfortunately the extended misuse of them has caused great concern. Among many other applications, they are used in the detection of biomarkers so ...

Statistical assessment of experimental results: a graphical approach for comparing algorithms

Arza, E.; Ceberio, J.; Irurozki, E.; Pérez, A. (2021-08-25)

Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples ...

A cheap feature selection approach for the K -means algorithm

Capo, M.; Pérez, A.; Lozano, J.A. (2021-05)

The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision or sensor networks, represents a challenge for the K-means algorithm. In this regard, ...

K-means for Evolving Data Streams

Bidaurrazaga, A.; Pérez, A.; Capó, M. (2021-01-01)

Nowadays, streaming data analysis has become a relevant area of research in machine learning. Most of the data streams available are unlabeled, and thus it is necessary to develop specific clustering techniques that take ...

A Machine Learning Approach to Predict Healthcare Cost of Breast Cancer Patients

Rakshit, P.; Zaballa, O.; Pérez, A.; Gomez-Inhiesto, E.; Acaiturri-Ayesta, M.T.; Lozano, J.A. (2021)

This paper presents a novel machine learning approach to per- form an early prediction of the healthcare cost of breast cancer patients. The learning phase of our prediction method considers the following two steps: i) in ...

On the fair comparison of optimization algorithms in different machines

Arza, E.; Pérez, A.; Ceberio, J.; Irurozki, E. (2021)

An experimental comparison of two or more optimization algorithms requires the same computational resources to be assigned to each algorithm. When a maximum runtime is set as the stopping criterion, all algorithms need to ...

Identifying common treatments from Electronic Health Records with missing information. An application to breast cancer.

Zaballa, O.; Pérez, A.; Gómez-Inhiesto, E.; Acaiturri-Ayesta, T.; Lozano, J.A. (2020-12-29)

The aim of this paper is to analyze the sequence of actions in the health system associated with a particular disease. In order to do that, using Electronic Health Records, we define a general methodology that allows us ...

Minimax Classification with 0-1 Loss and Performance Guarantees

Mazuelas, S.; Zanoni, A.; Pérez, A. (2020-12-01)

Supervised classification techniques use training samples to find classification rules with small expected 0-1 loss. Conventional methods achieve efficient learning and out-of-sample generalization by minimizing surrogate ...

Statistical model for reproducibility in ranking-based feature selection

Urkullu, A.; Pérez, A.; Calvo, B. (2020-11-05)

The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the ...

General supervision via probabilistic transformations

Mazuelas, S.; Pérez, A. (2020-08-01)

Different types of training data have led to numerous schemes for supervised classification. Current learning techniques are tailored to one specific scheme and cannot handle general ensembles of training samples. This ...

Kernels of Mallows Models under the Hamming Distance for solving the Quadratic Assignment Problem

Arza, E.; Pérez, A.; Irurozki, E.; Ceberio, J. (2020-07)

The Quadratic Assignment Problem (QAP) is a well-known permutation-based combinatorial optimization problem with real applications in industrial and logistics environments. Motivated by the challenge that this NP-hard ...

An efficient K-means clustering algorithm for tall data

Capo, M.; Pérez, A.; Lozano, J.A. (2020)

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial ...

An adaptive neuroevolution-based hyperheuristic

Arza, E.; Ceberio, J.; Pérez, A.; Irurozki, E. (2020)

According to the No-Free-Lunch theorem, an algorithm that performs efficiently on any type of problem does not exist. In this sense, algorithms that exploit problem-specific knowledge usually outperform more generic ...

Supervised non-parametric discretization based on Kernel density estimation

Flores, J.L.; Calvo, B.; Pérez, A. (2019-12-19)

Nowadays, machine learning algorithms can be found in many applications where the classifiers play a key role. In this context, discretizing continuous attributes is a common step previous to classification tasks, the main ...

Approaching the Quadratic Assignment Problem with Kernels of Mallows Models under the Hamming Distance

Arza, E.; Ceberio, J.; Pérez, A.; Irurozki, E. (2019-07)

The Quadratic Assignment Problem (QAP) is a specially challenging permutation-based np-hard combinatorial optimization problem, since instances of size $n>40$ are seldom solved using exact methods. In this sense, many ...

On-line Elastic Similarity Measures for time series

Oregui, I.; Pérez, A.; Del Ser, J.; Lozano, J.A. (2019-04)

The way similarity is measured among time series is of paramount importance in many data mining and machine learning tasks. For instance, Elastic Similarity Measures are widely used to determine whether two time series are ...

On the evaluation and selection of classifier learning algorithms with crowdsourced data

Urkullu, A.; Pérez, A.; Calvo, B. (2019-02-16)

In many current problems, the actual class of the instances, the ground truth, is unavail- able. Instead, with the intention of learning a model, the labels can be crowdsourced by harvesting them from different annotators. ...

Predictive engineering and optimization of tryptophan metabolism in yeast through a combination of mechanistic and machine learning models

Zhang, J.; Petersen, S.; Radivojevic, T.; Ramirez, A.; Pérez, A.; Abeliuk, E.; Sánchez, B.; Costello, Z.; Chen, Y.; Fero, M.; Garcia Martin, H.; Nielsen, J.; Keasling, J.; Jensen, M. (2019)

In combination with advanced mechanistic modeling and the generation of high-quality multi-dimensional data sets, machine learning is becoming an integral part of understanding and engineering living systems. Here we show ...

Crowd Learning with Candidate Labeling: an EM-based Solution

Beñaran-Muñoz, I.; Hernández-González, J.; Pérez, A. (2018-09-27)

Crowdsourcing is widely used nowadays in machine learning for data labeling. Although in the traditional case annotators are asked to provide a single label for each instance, novel approaches allow annotators, in case ...

Are the artificially generated instances uniform in terms of difficulty?

Pérez, A.; Ceberio, J.; Lozano, J.A. (2018-06)

In the field of evolutionary computation, it is usual to generate artificial benchmarks of instances that are used as a test-bed to determine the performance of the algorithms at hand. In this context, a recent work on ...

On-Line Dynamic Time Warping for Streaming Time Series

Oregui, I.; Pérez, A.; Del Ser, J.; Lozano, J.A. (2017-09)

Dynamic Time Warping is a well-known measure of dissimilarity between time series. Due to its flexibility to deal with non-linear distortions along the time axis, this measure has been widely utilized in machine learning ...

Nature-inspired approaches for distance metric learning in multivariate time series classification

Oregui, I.; Del Ser, J.; Pérez, A.; Lozano, J.A. (2017-07)

The applicability of time series data mining in many different fields has motivated the scientific community to focus on the development of new methods towards improving the performance of the classifiers over this particular ...

An efficient approximation to the K-means clustering for Massive Data

Capo, M.; Pérez, A.; Lozano, J.A. (2017-02-01)

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial ...

Nature-inspired approaches for distance metric learning in multivariate time series classification

Oregui, I.; Del Ser, J.; Pérez, A.; Lozano, J.A. (2017)

The applicability of time series data mining in many different fields has motivated the scientific community to focus on the development of new methods towards improving the performance of the classifiers over this particular ...

Efficient approximation of probability distributions with k-order decomposable models

Pérez, A.; Inza, I.; Lozano, J.A. (2016-07)

During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models. Some of these algorithms can be used to search for a maximum likelihood decomposable ...

An efficient approximation to the K-means clustering for Massive Data

Capo, M.; Pérez, A.; Lozano, J.A. (2016-06-28)

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial ...

Efficient approximation of probability distributions with k-order decomposable models

Pérez, A.; Inza, I.; Lozano, J.A. (2016-01-01)

During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models. Some of these algorithms can be used to search for a maximum likelihood decomposable ...

More information

LANA

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

The LANA project aims to contribute to the implementation of the Smart Factory concept in the Basque Country through research and development of the scientific-technological lines in the ICT field that support it.

BCAM principal investigator: Aritz Pérez

LANA II

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

"The present project is a continuation of the LANA project.

BCAM principal investigator: Aritz Pérez

SEKUTEK

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

"This project is the embryo from which a cybersecurity ecosystem will be generated, prioritising the development of doctoral theses, the presentation of articles in leading journals, the presentation of papers in technological forums, the generation of patents that protect the IP (Intellectual Prope

BCAM principal investigator: Aritz Pérez

CyberPrest

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

"The SEKUTEK (Sekurtasun Teknologiak) project was launched in 2017, with the aim of coordinating the research activity in industrial cybersecurity of the different Basque agents to obtain the necessary degree of technological training to enable the transfer of this knowledge to Basque industry in th

BCAM principal investigator: Aritz Pérez

LangileOK

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

"The LangileOK project aims to advance in the development and implementation of the Smart Factory concept in the Basque Country, recognising the fundamental position that factory workers occupy in the production process.

BCAM principal investigator: Aritz Pérez

DIGITAL

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

DIGITAL aims to enable a servitisation model that can provide a generic response to the needs detected in various industrial sectors in the Basque Country. To support this servitisation model, DIGITAL identifies a series of necessary base technologies.

BCAM principal investigator: Aritz Pérez

SONETO

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: Regional Project

BCAM principal investigator: Aritz Pérez

SARA

Data Science & Artificial Intelligence (DS)

Machine Learning (ML)

Type: National Project

Status: Ongoing Project

SARA tiene como objetivo, contribuir al desarrollo de un nuevo ecosistema industrial con capacidad para adquirir el conocimiento colectivo, actualmente disperso entre diferentes personas y agentes de la industria, y estructurarlo según rasgos y atributos fabriles, que permitan mediante la aplicación

BCAM principal investigator: Aritz Pérez

MRCpy: a library for Minimax Risk Classifiers

MRCpy library implements minimax risk classifiers (MRCs) that are based on robust risk minimization and can utilize 0-1-loss.

Authors: Kartheek Reddy, Claudia Guerrero, Aritz Perez, Santiago Mazuelas

License: free and open source software

Download from

https://github.com/MachineLearningBCAM/MRCpy

OPTECOT - Optimal Evaluation Cost Tracking

This repository contains supplementary material for the paper Speeding-up Evolutionary Algorithms to solve Black-Box Optimization Problems. In this work, we have presented OPTECOT (Optimal Evaluation Cost Tracking): a technique to reduce the cost of solving a computationally expensive black-box optimization problem using population-based algorithms, avoiding loss of solution quality. OPTECOT requires a set of approximate objective functions of different costs and accuracies, obtained by modifying a strategic parameter in the definition of the original function. The proposal allows the selection of the lowest cost approximation with the trade-off between cost and accuracy in real time during the algorithm execution. To solve an optimization problem different from those addressed in the paper, the repository also contains a library to apply OPTECOT with the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) optimization algorithm.

Authors: Judith Echevarrieta, Etor Arza, Aritz Pérez

License: free and open source software

Download from

https://github.com/JudithEtxebarrieta/OPTECOT.git

TransfHH

A multi-domain methodology to analyze an optimization problem set

Authors: Etor Arza, Ekhiñe Irurozki, Josu Ceberio, Aritz Perez

License: free and open source software

Download

https://github.com/EtorArza/TransfHH

FractalTree

Implementation of the procedures presented in A. Pérez, I. Inza and J.A. Lozano (2016). Efficient approximation of probability distributions with k-order decomposable models. International Journal of Approximate Reasoning 74, 58-87.

Authors: Aritz Pérez

License: free and open source software

Download from

https://bitbucket.org/AritzPerez/fractaltree/src/master/

MixtureDecModels

Learning mixture of decomposable models with hidden variables

Authors: Aritz Pérez

License: free and open source software

Placement

Local

BayesianTree

Approximating probability distributions with mixtures of decomposable models

Authors: Aritz Pérez

License: free and open source software

Placement

Local

KmeansLandscape

Study the k-means problem from a local optimization perspective

Authors: Aritz Pérez

License: free and open source software

Placement

Local

PGM

Procedures for learning probabilistic graphical models

Authors: Aritz Pérez

License: free and open source software

Placement

Local

On-line Elastic Similarity Measures

Adaptation of the most frequantly used elastic similarity measures: Dynamic Time Warping (DTW), Edit Distance (Edit), Edit Distance for Real Sequences (EDR) and Edit Distance with Real Penalty (ERP) to on-line setting.

Authors: Izaskun Oregi, Aritz Perez, Javier Del Ser, Jose A. Lozano

License: free and open source software

Download from

https://bitbucket.org/izaskun_oregui/esm/src/master/