**Introduction
**

The ever-increasing energy needs of modern society, households, companies, and the economy itself alongside the imperative of developing a sustainable energy consumption model, have led individuals and companies to reconsider the way they consume energy. The quest for a a more sustainable and Eco-friendly approach that will ensure energy preservation for future needs has made energy efficiency a highly relevant and pressing issue that will continue to be a hot topic for the years to come.

In the past, inconsistent results in measuring energy consumption have posed challenges to attempts of improving energy efficiency and reducing waste. However, with the emergence of Machine Learning (ML) algorithms, a new era of possibilities has opened up in the field of energy consumption analysis. Machine learning is a game-changer in the quest for energy efficiency, as it can identify patterns, correlations, and trends in energy consumption data. Thus, the integration of ML algorithms in energy management systems, enables us to gain a deeper understanding of usage patterns and to optimise energy usage for greater cost savings and sustainability.

**Energy efficiency and data breakdown**

Within EVIDENT project, an analysis is performed on the energy consumption and energy behaviour of each household. An online digital platform is used to gather energy data on a regular (monthly) basis. Our dataset consists of numerical (e.g., household consumption in a given period of time) and demographic (e.g., number and age of household members) data. Since households may vary significantly among each other, it is crucial to obtain a large volume of data prior to proceeding with any analysis. As so, the dataset provides a wealth of information that are used to build each user’s profile, as shown in detail in Figure 1.

**Gaining insights from energy data**

The collection of the appropriate energy and demographic information is the first step in the process of exploiting energy data to generate scientific conclusions through statistical analysis via ML models. This information, often called “raw data,” needs to be processed and analysed to become suitable for our algorithm and model to read and comprehend.

Data analysis allows us to examine the relationship between the different data categories, the correlation between demographic characteristics and energy consumption, and how each category impacts energy efficiency. Understanding the relationship between data augments the probability of generating robust energy efficiency suggestions that will take into account all demographic and consumption data. Energy data analysis can also provide insight into how much energy each of the user’s electrical appliances consume, what is usually called a “dis-aggregated consumption profile.” In this way, the suggestions generated become more targeted allowing the users to comfortably adjust to minor behavioural changes that can be sustained in the long run.

The process from data collection to generation energy efficiency suggestions through ML tools is a systematic procedure which requires time and effort. The data collected is first archived and stored. The data cleaning process that follows, removes incorrect, corrupt, or incomplete data from the dataset. After having cleaned the dataset, we group data in subsets. In this phase, maintaining the quality of our data is of foremost importance. This process results in the creation of a smaller dataset (compact dataset) of comparable quality to the original one. It is this compact dataset that is used to test the performance and accuracy of the analytics tools we develop.

**Energy consumption data in machine learning**

To generate energy efficiency suggestions from the compact dataset, we use statistical analysis and suggestion model procedures that belong to the field of ML. In this step of the process, we focus on selecting the appropriate statistical models to accurately define the relationship between the categories of our data as well as their impact on energy consumption. The processes that define the connection between the variables of the dataset can be established with ARIMA [1], Cross–correlation [2], and Linear Discriminant Analysis (LDA) [3] alongside the use of depiction tools for comparison and observation purposes, such as histograms and scatter-plots. These methods point out which parameters from the whole dataset affect energy consumption the most. These are called “key parameters”. Determining the key parameters allows us to disregard any low-significant features that do not affect energy consumption significantly. This allows us to reduce the volume of the dataset to simplify the analysis without sacrificing its quality.

The dataset includes demographic information which are in a categorical format. A categorical variable (sometimes called as nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories (e.g., race, sex, age group, educational level). One of the main challenges of working with categorical data is that most ML algorithms are designed to work with numerical data. ML is not working efficiently with non–numeric values and it is difficult to make accurate assumptions and versatile procedures. Numerical data allows us to perform arithmetic operations and can also be used for any statistical analysis. Thus, categorical data must be transformed into a numerical format to be used as input for the ML models.

Subsequently, the dataset is divided into two groups, the “train subset” and the “test subset”, that are basic inputs for ML predictive analysis. The main

difference between the two subsets is that training data is used to train the machine learning model, whereas testing data is used to check the accuracy of the model. The training dataset is generally larger in size compared to the testing dataset. Then, the following step is the prediction part which is based on various methodologies such as Random Forest [4], Linear Regression [5], Artificial Neural Networks (ANNs) [6] etc. The output of the model is a prediction of the consumption for each consumer that determines how energy efficient they are. The resulting consumption value along with the demographics of each user define what we call a ‘consumption profile’. These consumption profiles are categorised in X different groups on the basis of their energy efficiency. Depending on which group each consumption profile belongs to, personalized energy suggestions and tips are proposed.

**Challenges and future work**

Although ML applications seem promising in promoting energy efficiency for the residential sector, they do not stand without challenges. The first one relates to the energy behaviour of users, which does not always follow predetermined or expected patterns. This confounds ML models and limits the effectiveness of their performance. The second challenge relates to the data itself. Even though the volume of data on total energy consumption per household has been constantly increasing, there is a shortage of data accurately measuring consumption per individual appliance.

The two aforementioned challenges have led to the development of iterative research approaches and new hardware concepts. This includes the use of new advancements (e.g., smart-metering, electricity usage monitors) that can provide more granular data to better understand energy consumption patterns and the evolution of new ML techniques that are more capable of decomposing complex data.

The deployment of the proposed approaches we described and the emergence of ML tools, promise to deepen our understanding of residential energy consumption habits. The analysis of said habits allows for the potential to develop feasible, sustainable, and energy efficient practices.

**References**

[1] S. L. Ho, M. Xie, and T. N. Goh, “A comparative study of neural network and Box–Jenkins ARIMA modeling in time series prediction,” Comput. Ind. Eng., vol. 42, no. 2, pp. 371–375, 2002.

[2] N. K. Verma, K. Piyush, R. K. Sevakula, S. Dixit, and A. Salour, “Ranking of sensitive positions based on statistical parameters and cross correlation analysis,” in 2012 Sixth International Conference on Sensing Technology (ICST), 2012, pp. 815–821.

[3] P. Xanthopoulos, P. M. Pardalos, and T. B. Trafalis, “Linear Discriminant Analysis,” in Robust Data Mining, New York, NY: Springer New York, 2013, pp. 27–33.

[4] M. Belgiu and L. Druaguct, “Random forest in remote sensing: A review of applications and future directions,” ISPRS J. Photogramm. Remote Sens., vol. 114, pp. 24–31, 2016.

[5] X. Su, X. Yan, and C.–L. Tsai, “Linear regression,” WIREs Comput. Stat., vol. 4, no. 3, pp. 275–294, 2012.

[6] J. J. Hopfield, “Artificial neural networks,” IEEE Circuits Devices Mag., vol. 4, no. 5, pp. 3–10, 1988.