M
+34 946 567 842
F
+34 946 567 842
E
aperez@bcamath.org
Information of interest
 Orcid: 0000000281281099
Postdoc Fellow at BCAM. The main methodological research lines include probabilistic graphical models, supervised classification, information theory, density estimation and feature subset selection. The methodological contributions have been applied to the fields of bioinformatics (genetics and epigenetics) and ecological modelling (fisheries).

SpeedingUp Evolutionary Algorithms to Solve BlackBox Optimization Problems
(20240110)Populationbased evolutionary algorithms are often considered when approaching computationally expensive blackbox optimization problems. They employ a selection mechanism to choose the best solutions from a given population ...

Largescale unsupervised spatiotemporal semantic analysis of vast regions from satellite images sequences
(2024)Temporal sequences of satellite images constitute a highly valuable and abundant resource for analyzing regions of interest. However, the automatic acquisition of knowledge on a large scale is a challenging task due to ...

Efficient Learning of Minimax Risk Classifiers in High Dimensions
(20230801)Highdimensional data is common in multiple areas, such as health care and genomics, where the number of features can be tens of thousands. In such scenarios, the large number of features often leads to inefficient ...

Fast KMedoids With the l_1Norm
(20230726)Kmedoids clustering is one of the most popular techniques in exploratory data analysis. The most commonly used algorithms to deal with this problem are quadratic on the number of instances, n, and usually the quality of ...

Fast Computation of Cluster Validity Measures for Bregman Divergences and Benefits
(2023)Partitional clustering is one of the most relevant unsupervised learning and pattern recognition techniques. Unfortunately, one of the main drawbacks of these methodologies refer to the fact that the number of clusters is ...

Learning the progression patterns of treatments using a probabilistic generative model
(20221215)Modeling a disease or the treatment of a patient has drawn much attention in recent years due to the vast amount of information that Electronic Health Records contain. This paper presents a probabilistic generative model ...

Implementing the Cumulative Difference Plot in the IOHanalyzer
(202207)The IOHanalyzer is a webbased framework that enables an easy visualization and comparison of the quality of stochastic optimization algorithms. IOHanalyzer offers several graphical and statistical tools analyze the results ...

An active adaptation strategy for streaming time series classification based on elastic similarity measures
(20220521)In streaming time series classification problems, the goal is to predict the label associated to the most recently received observations over the stream according to a set of categorized reference patterns. In online ...

Generalized Maximum Entropy for Supervised Classification
(202204)The maximum entropy principle advocates to evaluate events’ probabilities using a distribution that maximizes entropy among those that satisfy certain expectations’ constraints. Such principle can be generalized for ...

Rank aggregation for nonstationary data streams
(2022)The problem of learning over nonstationary ranking streams arises naturally, particularly in recommender systems. The rankings represent the preferences of a population, and the nonstationarity means that the distribution ...

On the relative value of weak information of supervision for learning generative models: An empirical study
(2022)Weakly supervised learning is aimed to learn predictive models from partially supervised data, an easytocollect alternative to the costly standard full supervision. During the last decade, the research community has ...

LASSO for streaming data with adaptative filtering
(2022)Streaming data is ubiquitous in modern machine learning, and so the development of scalable algorithms to analyze this sort of information is a topic of current interest. On the other hand, the problem of l1penalized ...

Are the statistical tests the best way to deal with the biomarker selection problem?
(2022)Statistical tests are a powerful set of tools when applied correctly, but unfortunately the extended misuse of them has caused great concern. Among many other applications, they are used in the detection of biomarkers so ...

On the use of the descriptive variable for enhancing the aggregation of crowdsourced labels
(2022)The use of crowdsourcing for annotating data has become a popular and cheap alternative to expert labelling. As a consequence, an aggregation task is required to combine the different labels provided and agree on a single ...

Machine learning from crowds using candidate setbased labelling
(2022)Crowdsourcing is a popular cheap alternative in machine learning for gathering information from a set of annotators. Learning from crowdlabelled data involves dealing with its inherent uncertainty and inconsistencies. In ...

Dirichlet process mixture models for nonstationary data streams
(2022)In recent years, we have seen a handful of work on inference algorithms over nonstationary data streams. Given their flexibility, Bayesian nonparametric models are a good candidate for these scenarios. However, reliable ...

Nonparametric discretization for probabilistic labeled data
(2022)Probabilistic label learning is a challenging task that arises from recent realworld problems within the weakly supervised classification framework. In this task algorithms have to deal with datasets where each instance ...

Comparing Two Samples Through Stochastic Dominance: A Graphical Approach
(2022)Nondeterministic measurements are common in realworld scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples ...

Statistical assessment of experimental results: a graphical approach for comparing algorithms
(20210825)Nondeterministic measurements are common in realworld scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples ...

A cheap feature selection approach for the K means algorithm
(202105)The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision or sensor networks, represents a challenge for the Kmeans algorithm. In this regard, ...

Kmeans for Evolving Data Streams
(20210101)Nowadays, streaming data analysis has become a relevant area of research in machine learning. Most of the data streams available are unlabeled, and thus it is necessary to develop specific clustering techniques that take ...

On the fair comparison of optimization algorithms in different machines
(2021)An experimental comparison of two or more optimization algorithms requires the same computational resources to be assigned to each algorithm. When a maximum runtime is set as the stopping criterion, all algorithms need to ...

A Machine Learning Approach to Predict Healthcare Cost of Breast Cancer Patients
(2021)This paper presents a novel machine learning approach to per form an early prediction of the healthcare cost of breast cancer patients. The learning phase of our prediction method considers the following two steps: i) in ...

Identifying common treatments from Electronic Health Records with missing information. An application to breast cancer.
(20201229)The aim of this paper is to analyze the sequence of actions in the health system associated with a particular disease. In order to do that, using Electronic Health Records, we define a general methodology that allows us ...

Minimax Classification with 01 Loss and Performance Guarantees
(20201201)Supervised classification techniques use training samples to find classification rules with small expected 01 loss. Conventional methods achieve efficient learning and outofsample generalization by minimizing surrogate ...

Statistical model for reproducibility in rankingbased feature selection
(20201105)The stability of feature subset selection algorithms has become crucial in realworld problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the ...

General supervision via probabilistic transformations
(20200801)Different types of training data have led to numerous schemes for supervised classification. Current learning techniques are tailored to one specific scheme and cannot handle general ensembles of training samples. This ...

Kernels of Mallows Models under the Hamming Distance for solving the Quadratic Assignment Problem
(202007)The Quadratic Assignment Problem (QAP) is a wellknown permutationbased combinatorial optimization problem with real applications in industrial and logistics environments. Motivated by the challenge that this NPhard ...

An efficient Kmeans clustering algorithm for tall data
(2020)The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial ...

An adaptive neuroevolutionbased hyperheuristic
(2020)According to the NoFreeLunch theorem, an algorithm that performs efficiently on any type of problem does not exist. In this sense, algorithms that exploit problemspecific knowledge usually outperform more generic ...

Supervised nonparametric discretization based on Kernel density estimation
(20191219)Nowadays, machine learning algorithms can be found in many applications where the classifiers play a key role. In this context, discretizing continuous attributes is a common step previous to classification tasks, the main ...

Approaching the Quadratic Assignment Problem with Kernels of Mallows Models under the Hamming Distance
(201907)The Quadratic Assignment Problem (QAP) is a specially challenging permutationbased nphard combinatorial optimization problem, since instances of size $n>40$ are seldom solved using exact methods. In this sense, many ...

Online Elastic Similarity Measures for time series
(201904)The way similarity is measured among time series is of paramount importance in many data mining and machine learning tasks. For instance, Elastic Similarity Measures are widely used to determine whether two time series are ...

On the evaluation and selection of classifier learning algorithms with crowdsourced data
(20190216)In many current problems, the actual class of the instances, the ground truth, is unavail able. Instead, with the intention of learning a model, the labels can be crowdsourced by harvesting them from different annotators. ...

Predictive engineering and optimization of tryptophan metabolism in yeast through a combination of mechanistic and machine learning models
(2019)In combination with advanced mechanistic modeling and the generation of highquality multidimensional data sets, machine learning is becoming an integral part of understanding and engineering living systems. Here we show ...

Crowd Learning with Candidate Labeling: an EMbased Solution
(20180927)Crowdsourcing is widely used nowadays in machine learning for data labeling. Although in the traditional case annotators are asked to provide a single label for each instance, novel approaches allow annotators, in case ...

Are the artificially generated instances uniform in terms of difficulty?
(201806)In the field of evolutionary computation, it is usual to generate artificial benchmarks of instances that are used as a testbed to determine the performance of the algorithms at hand. In this context, a recent work on ...

OnLine Dynamic Time Warping for Streaming Time Series
(201709)Dynamic Time Warping is a wellknown measure of dissimilarity between time series. Due to its flexibility to deal with nonlinear distortions along the time axis, this measure has been widely utilized in machine learning ...

Natureinspired approaches for distance metric learning in multivariate time series classification
(201707)The applicability of time series data mining in many different fields has motivated the scientific community to focus on the development of new methods towards improving the performance of the classifiers over this particular ...

An efficient approximation to the Kmeans clustering for Massive Data
(20170201)Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial ...

Natureinspired approaches for distance metric learning in multivariate time series classification
(2017)The applicability of time series data mining in many different fields has motivated the scientific community to focus on the development of new methods towards improving the performance of the classifiers over this particular ...

Efficient approximation of probability distributions with korder decomposable models
(201607)During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models. Some of these algorithms can be used to search for a maximum likelihood decomposable ...

An efficient approximation to the Kmeans clustering for Massive Data
(20160628)Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial ...

Efficient approximation of probability distributions with korder decomposable models
(20160101)During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models. Some of these algorithms can be used to search for a maximum likelihood decomposable ...
MRCpy: a library for Minimax Risk Classifiers
MRCpy library implements minimax risk classifiers (MRCs) that are based on robust risk minimization and can utilize 01loss.
Authors: Kartheek Reddy, Claudia Guerrero, Aritz Perez, Santiago Mazuelas
License: free and open source software
OPTECOT  Optimal Evaluation Cost Tracking
This repository contains supplementary material for the paper Speedingup Evolutionary Algorithms to solve BlackBox Optimization Problems. In this work, we have presented OPTECOT (Optimal Evaluation Cost Tracking): a technique to reduce the cost of solving a computationally expensive blackbox optimization problem using populationbased algorithms, avoiding loss of solution quality. OPTECOT requires a set of approximate objective functions of different costs and accuracies, obtained by modifying a strategic parameter in the definition of the original function. The proposal allows the selection of the lowest cost approximation with the tradeoff between cost and accuracy in real time during the algorithm execution. To solve an optimization problem different from those addressed in the paper, the repository also contains a library to apply OPTECOT with the CMAES (Covariance Matrix Adaptation Evolution Strategy) optimization algorithm.
Authors: Judith Echevarrieta, Etor Arza, Aritz Pérez
License: free and open source software
TransfHH
A multidomain methodology to analyze an optimization problem set
Authors: Etor Arza, Ekhiñe Irurozki, Josu Ceberio, Aritz Perez
License: free and open source software
FractalTree
Implementation of the procedures presented in A. Pérez, I. Inza and J.A. Lozano (2016). Efficient approximation of probability distributions with korder decomposable models. International Journal of Approximate Reasoning 74, 5887.
Authors: Aritz Pérez
License: free and open source software
MixtureDecModels
Learning mixture of decomposable models with hidden variables
Authors: Aritz Pérez
License: free and open source software
Placement
Local
BayesianTree
Approximating probability distributions with mixtures of decomposable models
Authors: Aritz Pérez
License: free and open source software
Placement
Local
KmeansLandscape
Study the kmeans problem from a local optimization perspective
Authors: Aritz Pérez
License: free and open source software
Placement
Local
PGM
Procedures for learning probabilistic graphical models
Authors: Aritz Pérez
License: free and open source software
Placement
Local
Online Elastic Similarity Measures
Adaptation of the most frequantly used elastic similarity measures: Dynamic Time Warping (DTW), Edit Distance (Edit), Edit Distance for Real Sequences (EDR) and Edit Distance with Real Penalty (ERP) to online setting.
Authors: Izaskun Oregi, Aritz Perez, Javier Del Ser, Jose A. Lozano
License: free and open source software