WP3 Report 3 1



Executive Summary

The following report is a summary of Task 4 Future perspectives done within the work package WP3 Smart meters of ESSNet Big Data project. The objective of the Task 4 was to suggest other potential statistical products in the domain of energy consumption or in other statistical domains relying on the results from Tasks 3. In addition to the electricity smart meters, the potential usage of other smart meters were discussed. Finally, an overview of the different ways to aggregate the raw data and produce statistics from different level of aggregated data (e.g. yearly, monthly, hourly) was given.

The aim of this task was to use linked data obtained from previous tasks and find potential usages for this data. The electricity data has a great potential as a complementary data source for different tasks. The data could be related to several social and economic areas so that it would provide new insight to a phenomenon. Based on this intention we evaluated different categorization methods and used classification methods to get better insight to behavior and properties of electricity consumers. We also used potential of the data to reveal the dynamics of energy use by economic activity and regional consumption patterns.

During the study we have made following findings:

  • When high quality dependent variables available, which describe households, the supervised learning approach can be used. The unsupervised methods is used if the cluster structure of the data has to be discovered from the data.
  • A random forest model estimation for the nominal variable urban/rural about 77 percent of the smart meters correctly
  • For identifying vacant or seasonally vacant dwellings two approaches can be used. One is to use electricity consumption data and identify zero or close to zero consumption at a certain period of time, another is to apply classification methods to electricity data.
  • Depending on the aggregation level different goals can be achieved. By using yearly data only empty dwellings with zero consumption can be identified and dwellings can be categorized as empty by stating certain threshold for consumption. By using monthly aggregated data seasonal patterns can be identified and by using hourly or more frequent data patterns of living can be identified and there are higher chances to classify households correctly.
  • The electricity data has a great potential to produce regional statistics and to see what are the main business activates in regions and identify the main centers of different activity areas.
  • The smart meter data could be used for identifying consumption patterns and give some background knowledge to some everyday phenomena. As we studied energy consumption around switching time between summer- and wintertime and we did not found very clear evidence of saving electricity by applying daylight saving time policy
  • There are in use several different smart meters which have potential to provide data for producing statistics or to be a supplementary data source: the water and gas meters, the heating meters, the weather stations, the data of power monitors and the waste bins trackers. The amount of IoT (internet of things) devices is increasing rapidly and probably there are much more potential devices, which can be provide data for replacing existing data sources or to be a new or additional data source.
  • Feasibility of the use of on different level aggregated data From the study of the aggregated data we concluded that aggregation level determines what are the statistical products that can be produced from the smart meter data. The aggregation level and availability of metadata of a metering point determines what kind of products can be produced from the electricity data.
  • The more detailed information is available, the more products can be produced, but at the same time the more storage space and computing power is needed to handle the data. To speed up the calculations it is recommended to use temporary tables of aggregated data even in the case the raw metering data is available.

The report contains three main sections:

  • Section 3. Potential usages of smart meters data
    • 3.1 - 3.4 Clustering and classifications
    • 3.5 - 3.8 Electricity Business activity and regional electricity statistics
    • 3.9 Identifying consumption patterns
  • Section 4. Other smart meters
  • Section 5. Feasibility of the use of on different level aggregated data


Electricity smart meters are planned to be installed in many countries by the end of 2020 and on the one hand it provides a detailed overview of the electricity market in European countries, but on the other hand it is a valuable source of data to do statistics. The main goals of the ESSNet Big Data Workpackage 3 was to demonstrate the potential use of data from smart electricity meters for production of official statistics. The pilot had three goals with regard to expected outputs. First, to assess whether current survey based business statistics can be replaced by statistics produced from electricity smart meter data, second, to produce new household statistics and third, to identify vacant or seasonally vacant dwellings. The aim of the study conducted for SGA-2 was to look beyond the goals set for the work package and identify potential statistical products in the domain of energy consumption or in other statistical domains by relaying on the data produced during the previous task. As a result, during the SGA-2 we focused on classification and providing different use cases of using the smart meter data. The smart meter data was also used to produce regional electricity statistics and it was evaluated whether the data can be used to produce tourism statistics. In addition, from the theoretical viewpoint potential uses for different kinds of smart meters (e.g. natural gas, water) were proposed. We also evaluated what is the impact of aggregation for producing different statistical products from the smart meter data. Six National Statistical Institutes participated in ESSNet Big Data Work package 3: Statistics Austria, Statistics Denmark, Statistics Estonia, Statistics Italy, Statistics Portugal, and Statistics Sweden.

In the previous reports of the work-package the access to the data and possibilities to link electricity data with other sources was evaluated. The aim of this report is to suggest potential new uses of the electricity data. In addition a brief overview of other smart meters and sensors is given and its possible usefulness to the official statistics is evaluated. As different countries have access to data with different aggregation level the outputs that can be computed based on the existing data, comparing the cost of producing outputs and quality of the outputs with different kind of data as input is analyzed and recommended.

Potential usages of smart meters data

Usage: classification methods

Input data

Based on a sample data set of smart meters provided by Statistics Estonia in their safe center, several classification algorithms were tested to identify characteristics of the smart meter user. All code is available on Github [1] in the subfolder classification. The sample consists of 1000 businesses and 1000 households which were sampled from the linked smartmeter data and one year of reading for each. It was splitted into two parts: the actual reading data set ( with the columns ID, time and consumption) and the background information, which was different for households and businesses. For households the following background variables were available:

  • household type,
  • number of rooms,
  • year of construction of the building,
  • size in square meters,
  • household size and
  • urban/rural.

For business characteristic of the enterprises were available:

  • turnover
  • number of employees,
  • nace classification and
  • region.

Feature definition

Some of them were used as target variables for training different models and evaluating their goodness-of-fit. The first step was to build features/variables which are meaning for the different target variables. A first step was to aggregate the data to days per smart meter. The following aggregated variables were computed in this step:

  • mean consumption,
  • ratio of total consumption and daytime consumption (day time was defined as 8:00 to 19:59, which could be refined at a later stage),
  • variance of the hourly readings and
  • the standard deviation of the differenced time series for the specific day.

After this step the data set has four variables of each day of each smart meter.

The next step was to generate a data set with each line corresponding to a single smart meter, therefore to aggregate over days for each smart meter ID. In this step the following variables are computed:

  • mean of consumption,
  • standard deviation of the mean consumption,
  • mean of the ratio of total consumption and daytime consumption,
  • standard deviation of total consumption and daytime consumption,
  • mean of the variance of the hourly readings,
  • standard deviation of the hourly readings,
  • mean of the standard deviation of the differenced time series,
  • standard deviation of the differenced time series and
  • the ratio between the mean of consumption on week days and the mean consumption on weekend days.

Classification methods and results

For evaluating the goodness-of-fit repeated k-fold cross-validation was applied. In this procedure each of the b times the data set is divided into random subsamples where k-1 subsamples are used as a training data set for the model and the one remaining subsample is used for evaluation.

In a realistic scenario with no auxiliary information for a specific smart meter, the first step would be to classify it as either business and household and then go into estimating additional characteristics which are necessary for additional aggregation steps.

The setting of b=3 and the k=10 was chosen. Several different modelling / machine learning algorithms are tested. The first set of results show the confusion matrix for estimating if a smart meter belongs to a household (H) or a business (B). In the first cell of each table the proportion of correctly classified observations are shown.

Logistic regression

A natural choice for a baseline model is the logistic regression model, which already achieves a classification rate of almost 70 percent.

69.6 % B H
B 29.1 % 11.1 %
H 19.3 % 40.4 %

Boosted logistic regression

Boosting describes the technique of trying to estimate the residuals of the first model (in this case a logistic regression) again with a model, this can be done in several iterations and therefore the number of boosting iterations is an important tuning parameter for this method. The classification rate of the logistic regression can be increased by almost 12 percent to about 82 percent.

81.7 % B H
B 37.7 % 7.6 %
H 10.8 % 44.0 %

k nearest neigbour

Using the euclidean distance and 10 nearest neighbors about 80 percent of the smart meters can be classified into the right category.

79.4 % B H
B 33.5 % 5.7 %
H 15.0 % 45.9 %

Bayesian Generalized Linear Model

79.3 % B H
B 31.6 % 3.9 %
H 16.8 % 47.7 %

Bagged CART

Classification and regression trees are tree-based methods from machine learning. Bagging describes the procedure of averaging different models, this models are estimated on a bootstrapped input data set. The classification rate of 85.5 % is just one percent below the best classification rate of 86.6 percent.

85.5 % B H
B 39.9 % 5.9 %
H 8.6 % 45.6 %

Support vector machine

79.1 % B H
B 29.9 % 2.3 %
H 18.6 % 49.2 %

Neural net

83.0 % B H
B 36.4 % 4.9 %
H 12.1 % 46.7 %

Random forest

The best results from the tested classification method is reached by random forest (with a quite high number of trees, this can be tested with using the number of trees as tuning parameter) with a classification rate of 86.6 percent

86.6 % B H
B 39.8 % 4.8 %
H 8.6 % 46.8 %

Random forest model for urban/rural classification of households

A random forest model is estimated for the nominal variable urban/rural and about 77 percent of the smart meters can be classified correctly.

76.9 % Rural Urban
Rural 16.5 % 8.8 %
Urban 14.3 % 60.3 %

Random forest model for household size

For this experiment the household size is set to a maximum of 5. The classification is 44 percent for estimating the household size exactly, but in over 80 percent the estimation error is smaller than one.

Random forest model for NACE code classification

The 14 NACE groups are used for training and testing the random forest model. The resulting classification rate is 36 percent. When aggregating the results into the service and production industry, 66 percent are classified correctly.

Conclusion - Classification methods

One of the crucial points in classification and general in modelling is feature definition, since the quality of the achieved results will highly depend on it. From the experience of the author, random forest (which gave the best results in this little study) is a very versatile method with a lot of potential to be directly applied without too much time spent on regularization or tuning of the model.

Classification case study: Portugal data


The National Dwelling Register (FNA) was recently created with Census 2011 micro data and administrative data is used to update the register. The main problem now is to determine dwelling status in terms of occupation (primary, secondary or vacant). The administrative source that we are studying is the electricity consumption file to determine a threshold that could ascertain the dwelling occupation. Therefore, why not use Big Data? namely smart meters technology that have more detailed data on electricity consumption. The file provided by the electricity company is just a sample file with no background information - consumer is anonymized and zip code is the only geographical information it has. Since there isn’t any link between the national dwelling file and the smart meters electricity consumption file, it’s not possible to determine dwelling occupation status for each unit at the moment. For that fact, we will use this information for quality and validation purposes. First, we will implement a methodology that depending on the behavior of each unit, can map it as primary, secondary or vacant. Secondly, that information will be grouped by zip code and compare it with the corresponding data on the National Dwelling Register. Based on the results achieved, we may decide to focus on the areas that have low correlation and try to improve the information with other studies or on-site investigation.


The ideal situation to determine dwelling status in terms of occupation (primary, secondary or vacant), is having access to all data captured by smart meters in Portugal. But, by one side, not all territory is covered by smart meters and secondly, there isn't yet any protocol with the electricity company to provide that information needed.

For that reason informal contacts were established and  a first sample file with 138 installations was provided.

The information sent is anonymized, and for that reason, the main purpose had to be shifted from determining dwelling occupation to update the dwelling register to just use this information for validation and quality purposes.

After having analyzing this data of 2015 electricity consumption readings, we came to the conclusion that the work could be done if more data and completed installations for some postal-code areas were available. However the electricity company could not satisfy our request but a more completed file that is being used for the company studies was delivered.

This new file, as the previous one, doesn’t have any metadata and the Smart Meter identification is sequential (1..n).

Has readings of electricity consumption in each 15 minutes of year 2016 and half of 2017.

For this reason, the record structure of the file is very simple:

                ID – Sequential number from 1 to 5757

                CP1 – Postal Code of 4 digits

                CP2 – Postal Code of 3 digits

                Timestamp – Complete date format YYYY-MM-DD HH:MM

                Consumption – Consumption of electricity

 Since this information is just a sample, was decided to use relational database to store the data. We use Oracle database and SQL Loader tool to import the data.

 Table 1 summarizes some information about the data.

Table 1. Key information of the data.






01-01-2016 00:00

01-01-2016 00:00

01-01-2017 00:00


01-07-2017 23:45

31-12-2016 23:45

01-07-2017 23:45

Nº Meters




Nº Records




Size (Gb)




Different ZIP Codes




  • Half year

Some transformations were made and two new columns where created:  data (yyy-mm-dd) and hour (HH:MM) and postal code was used to determine NUTS2, adding a new variable to the table.

Aggregations were made to support the data analysis work. For that, aggregation of data by day was done, having a column for each day (547 day columns) allowing build a “matrix” with the days in columns, smart meters (with NUTS2 and postal code) in rows and electricity consumption as measure (Table 2). The same process was done to map data by year and month.

Table 2. Daily matrix consumption.


The same structure was created but having the number of readings per day in one table and the variance of consumption per day in another table.

In this phase some exploration to the data was done do determine possible problems and to learn a little more about the data.

The next graphic shows the number of smart meters and the electricity consumption by month (Figure 1).


Figure 1. Number of Smart Meters and consumption by month.

It can be observed that the number of smart meters is consistent over the months with the exception of March 2017. The electricity consumption seems to have a normal behavior.

Since the postal code is very important in our study, the number of smart meter by postal code was analyzed (Figure 2).


Figure 2. Number of Smart Meter by Zip-Code.

Some postal codes are low represented but some of them seem to be well represented. So, in the future, the best represented could have special focus.

Another analysis to consider is to verify if each Smart Meter has a similar behavior in each electricity reading or if there is some inconsistency. By dividing the total consumption by number of readings for each smart meter it can be observed that some smart meters have abnormal behavior (Figure 3).


Figure 3. Average consumption of each smart meter.

These smart meters with very high consumption of electricity are probably not households (Figure 4). Although we have asked only for households meters, the file probably has other buildings. So, these IDs have to be removed from the study.

Since the preliminary study to the data has been done, it’s possible to proceed to the data mining of this data and try to determine dwelling occupation based on the behavior of each unit.


Figure 4. Consumption behaviour of 3 standard smart meters.




The implementation of smart meter technology in Portugal is still in development. For that reason, several smart meters have missing information in some periods (mainly in the early periods). Once it’s vital that every smart meter can be classified in occupation status, the missing values should be filled with some imputation method. Each missing value was imputed with the mean of the whole series of each smart meter.


Method 1

A cluster analysis was made with the purpose of grouping smart meters with similar behavior. It had to be considered a proximity measure and a grouping method. The Minkowsky’s distance was chosen as proximity measure since the variables (consumption per day) are correlated we could not use a Euclidean distance.

Setting up ten clusters, with the following number of smart meters and consumption average for cluster (Table 3).

Table 3. Clusters.


With these results it was concluded that cluster 1 are definitely Vacant (V) dwellings and the others Occupied (O) dwellings (Table 4).

Table 4. Classification.


Having defined the ones that are vacant and occupied based on their consumption behavior, it’s necessary to find the ones that are secondary.

For that, electricity consumption behavior of each smart meter on each month was studied. First the summer months were analyzed and compared with the rest of the year. For those that the average is much higher the status was changed to Secondary (S).

Then a month-by-month analysis was made, defining that if the dwelling has consumption in one month much higher than all the other months then the status should be changed to secondary. Finally the behavior in Weekends and on Weekdays was studied, if the average on weekends is more than three times the consumption on weekdays it can be assumed that they are secondary also (Table 5).

Table 5. Defining secondary dwellings


Obtaining the following results (Table 6).

Table 6. Final results.


Method 2

The intention is the application of the K-means method with k = 3, because it has an interpretative interest in the type of housing: main, secondary or vacant. Due to the high heterogeneity of the records, the next step will be to create some binary variables relevant to the definition of groups and limit the information to the first 7 days of the month:

Thus, for each of the 7 days, we associate the following variables: • H0_H8: 1 if ID has consumption between 0h and 8h; 0 otherwise.

• H8_H16: 1 if ID has consumption between 8am and 4pm; 0 otherwise.

• H16_H24: 1 if ID has consumption between 16h and 0h; 0 otherwise.

• S1X: 1 if ID has higher than average daily consumption; 0 otherwise.

• S2X: 1 if ID has consumption greater than 2 times the daily average; 0 otherwise.

• SDS: 1 if ID shows consumption at the weekend; 0 otherwise.

• FDS1X: 1 if ID has weekend consumption higher than the daily average; 0 otherwise.

The average consumption will take into account information on electricity consumption by NUTS II and total electricity consumers by NUTS II in 2015, which is available on the Statistics Portugal website.

The following variables may also be considered if they are considered relevant to the study.

• NUT2_11: 1 if ID belongs to the North region (11); 0 otherwise.

• NUT2_15: 1 if ID belongs to the Algarve region (15); 0 otherwise.

• NUT2_16: 1 if ID belongs to the Center region (16); 0 otherwise.

• NUT2_17: 1 if ID belongs to the Lisbon region (17); 0 otherwise.

• NUT2_18: 1 if ID belongs to the Alentejo region (18); 0 otherwise.

• NUT2_20: 1 if ID belongs to the Autonomous Region of the Azores; 0 otherwise.

• NUT2_30: 1 if ID belongs to the Autonomous Region of Madeira; 0 otherwise

From the data matrix, with x rows (associated to each of the ID's) and y columns (relative to each of the referred variables), the K-means algorithm will be applied:

Step 1: Initial partition of the subjects in k groups defined at the beginning;

Step 2: Calculate the centroids for each of the k groups and calculate the Euclidean distance of the centroids to each ID in the database;

Step 3: Group the ID's whose centroids are closest and repeat the previous step until the minimum distance of each ID to each of the centroids of the k groups does not change significantly.

Data aggregation

The classification results of method 1 were grouped by postal code showed in Table 7.

Table 7. Dwelling occupation status of smart meters using method 1.


The same aggregation was made to the dwellings of the national register. In this case there are four possible classification of occupation: Occupied (O), Secondary (S), Vacant (V) and Unknown (U).

Table 8. Dwelling occupation status of national dwelling register (FNA).


Combining data

The two sets of information were combined by postal code and for each region the differences found in each category are calculated (Table 9).

Table 9. Combining data and validating the differences.



Using smart meters data without any background information or metadata is possible but very difficult to obtain good results. To achieve better results some requisites should be observed:

• Have access to all of the data for the proposed study

• Have a channel of communication with the data owner to clarify doubts that could emerge

• Have the necessary technology and equipment to deal with huge amount of data

• Study the best methods to determine the behavior of consumption

The results obtained by this study weren’t conclusive because there is lack of information. It's necessary to have all the smart meters readings for the regions of study. More time for investigating other methods on classification the behavior of electricity consumption should be invested.

Nevertheless, doing this study allow us to realize that it’s possible to use this information on smart meters to help improve our national dwelling register. For that matter we will try in the future have access to all information needed and try to incorporate this process in our statistical production.

Classification case study: Danish data

In this section we use 2015 data to test whether we can predict the sector connected to a metering point. We select only a sample of data and only the sectors: public, industry, agriculture and households are selected. The purpose of the classification using Danish data is to show how one can visualize the relationship between the variables one wishes to use as background variables in the classification model. To perform the analysis we use the following variables:

  • The average of the total daytime/nighttime consumption distinguishing between summer and winter. 
  •  The average of the standard deviation of the consumption during daytime and nighttime respectively again distinguishing between summer and winter. 

One could create a number of additional variables describing the variation and consumption pattern. To be able to construct visualizations we use only two variables in out model. In all examples the average of the total daytime and the average of the total nighttime consumption are used in the model. 

Discriminant analysis

In this section both a linear discriminant analysis and a quadratic discriminant analysis is performed. The first assumes equal variances in all groups, the second allows for group specific variances. 

Linear discriminant analysis


Figure 5. The output of linear discriminant analysis. The plot shows the actual sectors in the upper part and the predicted sectors in the lower part.

The diagonal shows the percent correctly predicted per sector.

                              Agriculture Household Industry Public


Quadratic discriminant analysis


Figure 6. The output of quadratic discriminant analysis. The plot shows the actual sectors in the upper part and the predicted sectors in the lower part.

The quadratic analysis predict more correctly in the public sector but less so for agriculture and industry. In total more meters are correctly classified.


K nearest neighbours


Figure 7. The output of K nearest neighbors.

The dots in the plot above show the actual sectors and the borders show the area in which the knn would predict a given sector. In this model k=5 (Figure 7). 

Clustering case study: Vacant or seasonally vacant dwellings

The aim of the study is to evaluate whether vacant or seasonally vacant dwellings can be identified by using clustering methods. We are using hourly and monthly electricity consumption data and for labeling we use data from the register of people and the register of buildings. As the quality of the labeling data is not very good we use unsupervised clustering method for identifying clusters, which we expect to correspond mainly to empty dwellings and we use cluster centers for classifying households and dwellings. The data preparation and handling was performed in Hadoop Hive and Spark. For clustering a method of kmeans implemented in Hadoop Spark was used.


Electricity data

Statistics Estonia has access to the data in the Estonian electricity data hub for the years 2013-2015. As the share of installed smart meters was the highest in 2015 we have used only the 2015 readings. We have used different level of aggregated data of electricity consumption. For identifying households not consuming electricity yearly consumption data was used, for seasonal patterns a monthly consumption was used and for finding patterns of life a hourly smart meter data was used. To make the consumption patterns comparable the consumption data was normalized by maximum or average value of consumption of a metering point as the distribution of electricity consumption per day across the households is skewed heavily to the right, as displayed in Report 2.

In case the there is not smart meter installed the hourly consumption is modeled by service provider (Table10). The indicator whether metering point is installed is whether monthly consumption is in Wh accuracy or kWh accuracy, but it is not 100% accurate method for identifying smart meter installation time.

Table 10. The number of metering points of private customers. Share of manual and remote reading of metering points.

Year Metering Points Manual Remote
2013 623162 87% 13%
2014 624354 63% 37%
2015 625937 60% 40%

Register of people

From the register of people 1315944 residents were selected and by using a common address 538092 households were formed.

Register of buildings

From the register of buildings 730876 dwellings were selected.


The table of households was linked with the table of dwellings and for each household there were available the following feature groups which had the values in the range: dwellings_type (H - house or K - apartment), number of rooms in dwelling (10 rooms corresponds to 10+ rooms), construction year of the building (bins 0 - nan values, 1 - -1918, 2 - 1918-1940, 3 - 1940 - 1991, 4 - 1991+), dwelling size in m2 (bins 0 - 0-37.7, 1 - 37.7-45.2, 2 - 45.2-49.9, 3 - 49.9-60.9, 4 - 60.9-69.3, 5 - 69.3-96.8, 6 - 96.8+), number of people registered on that address according to the register of people (not actual) (10 corresponds to 10+ people in household), location in urban area or countryside.

For linking with metering data a strategy was used where households were linked with electricity data if the registry codes from agreement table and household table matched and household and metering point had the same address. it was possible to link 252566 metering points with households. From those 70 thousand metering points had smart readers. This linking strategy was used in case of hourly data.

With monthly data we used an alternative strategy and the metering data was liked by left join, so that household data was marked by NaN values if there was no linkage. For linking we used only features of household size and dwelling type (house 1 or apartment 0).

WP3 Est apartment consumption group.png

Figure 8. Distribution of low consumption apartments by local municipality. Legend (left) number of apartments, (right) pie chart of share of low consumption.

Methods, methodology and analysis

The aim of the study is to evaluate whether a clustering method can be used for identifying empty dwellings. For identifying empty or seasonally empty dwellings two approaches can be used. One is to use electricity consumption data and identify zero or close to zero consumption at a certain period of time, another is to apply clustering or classification methods to electricity data. By using zero or low consumption data empty dwellings can be identified (e.g., Figure 8). But classification is needed if there is higher than zero consumption due to some device (e.g., heating by electricity) or due to steady consumption it is difficult to discriminate whether the dwelling is empty or not. Depending on the aggregation level different goals can be achieved. By using yearly data only empty dwellings with zero consumption can be identified and dwellings can be categorized as empty by stating certain threshold for consumption. By using monthly aggregated data seasonal patterns can be identified and by using hourly or more frequent data patterns of living can be identified and there are higher chances to classify households correctly.

Machine learning techniques are classified as:

  • Supervised learning - when the training data contain labels (e.g., the corresponding household size is known for a metering point)
  • Unsupervised learning - when the training data do not contain labels (e.g., the corresponding household size is unknown for a metering point and must be identified from the data)
  • Semi-supervised learning - when the training data contain partially labeled data (a model is trained by using labelled data and later on used for classifying unlabeled data)

As there are high quality dependent variables available, which describe households, then supervised learning approach can be used [1]. In our case the quality of labeling data was not very good and there was not available a high quality information abut dwelling occupancy, therefore we had to use unsupervised method. As a method kmeans was used, which was implemented in Spark ml library.

In addition to kmeans the hierarchical clustering was applied to the cluster centers for revealing the relationships between the cluster centers. In the next study the upper part of the figure presents the dendogram and lower part the matrix of cluster center or labels values where each column corresponds to a cluster center.

"The dendogram illustrates how each cluster is composed by drawing a U-shaped link between a non-singleton cluster and its children. 
The top of the U-link indicates a cluster merge. The two legs of the U-link indicate which clusters were merged.  
The length of the two legs of the U-link represents the distance between the child clusters. 
It is also the cophenetic distance between original observations in the two children clusters." 
Description from the scipy manual

Two datasets were formed - one containing monthly data and other hourly readings. Only the year 2015 data was used. Both datasets were normalized. To reduce computational load only 7 first days of each month were selected for hourly data.

For identifying the number of clusters an elbow method was used (Figure 9). A subset of the data was used for finding the optimal number of clusters. As we can see the steep decease of error ended close to 50 clusters.

WP3 Est kmeans plot legal elbow.png

Figure 9. Elbow method for identifying optimal number of clusters.

Cluster analysis of monthly electricity consumption data

First, we study monthly aggregated data to see whether seasonally empty dwellings can be identified. For identifying the general patterns 16 cluster centers were formed.

By using monthly data a seasonal patterns of electricity consumption became visible (Figure 10). The most dominant cluster center is corresponding to steady consumption. There are cluster centers corresponding to clear winter and summer seasonality.

WP3 Report3 Est monthly max kmeans 16 centres private.png

Figure 10. 16 cluster centers of normalized monthly data. Label on the left - number of cluster center, number of instances in the cluster center.

All the consumption data was matched with cluster centers and sample data corresponding to cluster centers was visualized in Figure 11.

WP3 Report3 Est monthly max kmeans 16 centres private with data.png

Figure 11. 16 cluster centers of normalized monthly data (red) and samples from the data (gray).

To visualize similarity between the cluster centers we used approach were we connected hierarchical dendogram with cluster centers. Tehre are three main patters, first, sudden jumps in consumption in certain months, second steady consumption with small seasonality and third seasonal patters (Figure 12).

WP3 Report3 Est monthly max kmeans 16 centres private dendogram .png

Figure 12. Hierarchy dendogram of 16 cluster centers and a table of consumption pattern. Monthly normalized data.

To see what are the values of features related to cluster centers the consumption data was linked to feature values (household size and dwelling type) when possible (feature count) or not (feature count_nan). To highlight the dominance of the feature we used different normalization strategies. As we normalized by maximum value of feature we can see, that cluster center 0 is more related to households, which are not labeled (Figure 13). The cluster center 13 is mainly related to small families 1-2 persons and they are living mainly in apartment. The cluster center 10 is representing households with higher size and as we take a look at consumption pattern in Figure we see there is also visible seasonal pattern. Cluster centers 0 and 1 are related to houses and their consumption pattern is influenced by seasonality.

WP3 Report3 Est monthly max kmeans 16 centres private dendogram column.png

Figure 13. Hierarchy dendogram of cluster centers and a table of connected features normalized by maximum value of feature.

When the number of clusters was increased to 52 more complex patterns emerged (Figure 14). In this case the consumption data was normalized by average value. The patterns remained quite the same like we had 16 cluster centers and there were 3 main groups, but now we have more detailed view of it.

WP3 Report3 Est monthly avg kmeans 52 centres private with data.png

Figure 14. 52 cluster centers of monthly normalized data.

WP3 Report3 Est monthly avg kmeans 52 centres private dendogram .png

Figure15. Hierarchy dendogram of 52 cluster centers and a table of consumption pattern. Monthly normalized data.

Cluster analysis of hourly electricity consumption data

We used hourly consumption data to see whether it is possible to classify households. We expect to see that cluster centers can be related to certain household size and dwelling type.

Using hourly data for clustering revealed more complex consumption patterns (Figure 16).

WP3 Est kmeans 60 centres private.png

Figure 16. 60 cluster centers of hourly normalized data ordered by frequency. (Click on the figure to see the details)

Cluster centers 56 and 54 correspond to weekdays and weekends consumption patters (Figure 17). Centers 38 and 13 are seasonal patterns corresponding to winter and summer seasonality. There is a number of clusters which have probably adjusted values for a certain month in the year.

WP3 Est kmeans 60 centres private dendogram consumption.png

Figure 17. Hierarchy dendogram of cluster centers and a table of consumption pattern.

WP3 Est kmeans 60 centers private dendogram norm 0 labels.png

Figure 18. Hierarchy dendogram of cluster centers and a table of connected features without normalization.

There are 6 groups of features describing households (Figure 18). Fore each input data vector (household consumption pattern) the closest cluster center was found. Thereafter for each feature value frequency of occupancy was counted. For example for cluster center 3 has 595 urban values and 646 rural values.

The values were normalized by maximum value of each group. It is possible to identify which of the cluster center is conneted the highest value of a feature group and which value int he feature group is the dominant (Figure 19). Each column in the matrix corresponds to a cluster center. The most dominant cluster was cluster 22, which was linked to more than 7 thousand households. It also represents the most typical household, which tends to be located in urban area and be an apartment with two rooms with size of 50 m2 or less and house built between 1940 and 1991. The consumption pattern is stable and slightly increasing by the end of the year.

WP3 Est kmeans 60 centres private dendogram labels.png

Figure 19. Hierarchy dendogram of cluster centers and a table of connected features normalized by maximum value of feature group.

Depending on normalization the feature tables give additional insight to the data. Next the feature values were normalized by maximum value of a feature group for a cluster center (Figure 20). Figure indicates which value of the feature group dominates in a cluster center. For example for cluster center 3 the ore dominant value is urban Yes and for cluster center 40 urban No. The cluster center 40 has relatively higher values for more spacious dwellings.

WP3 Est kmeans 60 centers private dendogram norm 2 labels.png

Figure 20. Hierarchy dendogram of cluster centers and a table of connected features normalized by maximum value of within a feature group.

As the feature values were normalized by maximum value of a feature it is possible to see relative dominance of a feature (Figure 21). For example the cluster center 40 is connected with relatively high number of people in a household.

WP3 Est kmeans 60 centers private dendogram norm 3 labels.png

Figure 21. Hierarchy dendogram of cluster centers and a table of connected features normalized by maximum value of feature.


Evaluation of the linking data

There are many reasons why the linking quality was not the best one and only about 10% of metering points with smart meters could be used for classification. The first one is address quality as the free text address field reduced the linking rate with the registers. Second, there were only 40% of metering points smart meters in 2015.

Evaluation of classification quality

For classification it would be better to have more smart reading data available and on the other hand it would be better if more metering points are linked with adequate household data so that supervised methods could be used.


Two datasets were used for classification - one of monthly data and other of hourly data. Cluster centers of monthly data indicated clear seasonality in the dataset. It was possible to identify stable consumption patters and winter or summer seasonality. Some metering points had rising or decreasing consumption. As the number of clusters was increased to 52then some artifacts of one month high consumption became visible and probably it is related with adjustments of monthly consumption. The same patterns were also visible when hourly data was analyzed. Hourly data also indicated there are certain working days or weekends consumption patterns. It was possible to identify cluster centers which were more related to bigger household size. We can conclude that clustering can be used in identifying seasonally vacant dwellings.

Tourism statistics

The aim is to evaluate whether tourism statistics indicators can be extracted from the electricity data by finding correlations between electricity consumption of businesses active in tourism and current statistics of tourism. To achieve the goal there is needed to extract regional tourism statistics – number of visitors, number of tourist using accommodation and identifying companies providing services. Thereafter we have to find out metering points related to companies providing services related to tourism by linking those by registry code of contract owner or by finding linkage by address. Based on linked data aggregations can be done. Extracting companies active in accommodation businesses and aggregating their electricity consumption in regional level. The outcome will be a table for comparing tourism statistics and electricity consumption and a map of electricity consumption of businesses active in tourism.


In the sample of tourism statistics there are includes 1559 companies. From those it was possible to link 873 companies with electricity data and they were connected with 2097 unique metering points. Monthly electricity consumption data was used.

Tourism statistics indicators were extracted from the web page of Statistics Estonia. There were 13 indicators available for analysis, but the number of accommodation establishments and accommodated tourists by county (monthly data) was extracted. The dynamics of electricity consumption and tourism indicators is visualized in Figure 22. There is clear opposite seasonal component of electricity consumption and tourism activity. As we have shown in Report 2 there is not clear seasonality in electricity consumption of businesses as we aggregate all businesses. But the energy consumption for small companies follows seasonal pattern, as more electricity is used for heating and lightening in winter. At the same time the high season for tourism is in summer. We can only see that the minimum of electricity consumption is not the same time when there are the warmest and well-lighted time in Estonia and probably the high tourism season has some effect on electricity consumption.

WP3 Est Tourism y.png

Figure 22. Monthly aggregates of electricity consumption, the number of accommodation establishments and accommodated tourists normalized by maximum value. Methods, methodology and analysis


For analysis monthly data of electricity consumption, the number of accommodation establishments and accommodated tourists was extracted in a country level. For processing JDemetra+ software was used and by using multiplicative method (y = S * T * I) seasonal component, trend and residuals were extracted.

Figure 23 presents seasonal component of the data. We can see see clear opposite seasonality in electricity data (blue line) and tourism related data.

WP3 Est Tourism s.png

Figure 23. Seasonal component of electricity and tourism data.

As the year starts from winter when there is the highest usage of electricity, the electricity usage trend (Figure 24 blue line) tends to be decreasing. The tourism related indicators have rising trend.

WP3 Est Tourism t.png

Figure 24. Trend component of electricity and tourism data.

The residual component of electricity consumption and tourism indicators (Figure 25).

WP3 Est Tourism i.png

Figure 25. The residual component of electricity and tourism data.

Currently the tourism statistics is for the whole country but regional dynamics might give some additional insight to the subject (Figure 26).

WP3 Est tourism el county.png

Figure 26. Regional electricity consumption of businesses active in tourism. (draft image - will be replaced)


It was possible to link 56% of companies included into survey sample of tourism by using registry key. There was not done additional analysis whether all the electricity consumption of the metering points of those companies are involved in providing services related to tourism.

As it can be seen in Figure 27 there is rather strong negative correlation between electricity consumption and tourism indicators.

WP3 Est Tourism corr.png

Figure 27. Correlations between electricity consumption and tourism indicators. Accommodation establishments (accomm), accommodated tourists (tourist) and average electricity consumption (cons_avg).


There is no clear answer whether the electricity data can be used as an indicator of tourism, as the model is currently too simple to take into account seasonal effects of electricity consumption. There is slightly higher consumption in summer time during the high season of tourism and some more advanced method for extracting the seasonality components from the electricity data is needed. Further study in a regional level will be performed when the data becomes available.

Electricity consumption of a local municipality: processing the data and potential outputs

Available data

The Swedish data hub is under development and will be running from 2020. Until then, we have access to a restricted set of test data, received from the systems of two main grid operators in Sweden. The data comprise two municipalities (Täby and Högsby) and cover two years, 2015 and 2016. Data contain monthly aggregates of consumption and production, and include both businesses and households.

The units are metering points. Each point has a unique identification number (EAN code). If the metering point pertains to a business, it has an organization number (business level). Each metering point has an address, which may or may not be the same address as the business or the household, depending on where the measurement of used or produced electricity takes place.

If a metering point does not have an organization number, we assume that it belongs to a household. The Täby data further contain apartment identification, so that the data can be divided on households in apartments and households in detached family houses.

Businesses or households may produce their own electricity. The surplus is transferred into the grid and measured at the metering point. Thus, the production variable is a measure of surplus production, but there is no measurement of the amount of own produced electricity used by the business or household. For the Täby data, we also have information on type of production (solar cells, etc). The net production of solar energy by households in Täby was 52230 kWh.

Table 11 shows the number of observations (months) and the number of units (metering points) in the data.

Table 11. Description of the data.

Year # observations (#units)
Täby Högsby
Businesses Households Municipality Businesses Households
2015 37005 (3176) 315869 (26446) 6632 (555) 7066 (591) 49364 (4126)
2016 #8112 (3176) 317364 (26446) 6648 (555) 7063 (590) 49547 (4138)

We use the test data for two purposes:

  • To identify future possibilities for improved or new statistics, (together with subject matter specialists on energy, environment, and buildings) and investigate the possibilities to produce them. This includes for example how to link data to the registers kept at Statistics Sweden.
  • To investigate data quality and gain insights into the content and structure of the data. It is important to remember that the data are not real hub data, and that the hub will contain more attributes than we currently have access to, but the test data is still valuable so that we may avoid future caveats and be aware of limitations.

Outputs of interest

Data at local unit level

Today data for the survey Yearly energy statistics are collected by a web questionnaire, which is difficult for the respondents (grid operators) to answer and does not calibrate with their systems. The quality is poor. utilising hub data to estimate electricity use on NACE would lower the response burden, improve quality, and save time. It would also allow for estimates on new breakdowns and subgroups of interest.

The final hub data will include organization number but not local unit identification number. Other ways of linking to available registers have to be considered, using the available information and coordinates.

Possible outputs using test data:

  • Consumption measured at local unit level by NACE:
    • The relevant main table for Yearly electricity statistics
    • For all local units, in particular the service sector
  • Production measured at local unit level by NACE
    • Identifying energy producing units that are not local units (i e no employees)

Building characteristics

With address information and coordinates, it would also be possible to link to the building register.

Possible outputs using test data:

  • Consumption by relevant building characteristics
  • Production by relevant building characteristics

Sustainable energy production

By linking to other registers, characteristics of the producers of energy are available.

Possible outputs using test data:

  • Producers by business or household characteristics

Identifying local units renting in a building

Local units without metering points are not possible to measure directly but it could be possible to estimate with a model.


Business register

In order to classify energy consumption on NACE and test if relevant statistics can be produced, it is crucial to link the metering points with businesses at local unit level. We have tested linking to the Swedish Business Register (BR). The register contains a lot of information, including NACE code for local units and unique identification of businesses and local units. Information in the test data that could be useful for linking is:

  • Organization number of the businesses for which energy consumption is measured
  • Street name and number of the metering point
  • Coordinates of the metering point

A local unit may have several metering points, all of them having the same organization number, but different addresses. Matching on organization number alone revealed that some organization numbers could not be found in the BR. This probably indicates that they are invalid in the test data since it is unlikely that businesses are missing in BR. The organization number is not a crucial piece of information for the grid operators, and there is reason to doubt the quality.

Further variables are necessary in order to identify local units. Matching on street name and number is less useful since the address of the metering point may not be the same as the address of the local unit.

The Täby data for 2016 contain coordinates of the metering point, measured with centimeter precision. The BR have coordinates for the addresses of local units, measured with precision of one meter. We tested whether these coordinates could be used to match metering points to local units.

The distances between all metering point and all addresses of local units in Täby were calculated. Distance is measured by the Euclidian distance. Let d be the distance between a metering point with coordinates (xm,ym) and a local unit address with coordinates (xe,ye). Then 

WP3 Report3 Equation.png

We assume that a metering point measures the consumption of the local unit closest to the metering point, and that the nearest local unit must be within a certain distance. Thus, the local unit closest to the metering point is chosen, and its NACE code will be used to classify the consumption and the production at the metering point.

The metering points belonging to the municipality of Täby were not included. They are identified by the organization number of the municipality.

For each metering point, the distances to all local units in Täby were measured. This would be a very time-consuming exercise if it had to be carried out for the whole country, but it is reasonable to assume that a local unit and its metering point in most cases are located within the same municipality, so that runs can be parallelized between municipalities.

Table 12 summarizes the shortest distances between metering points and local units. We assume (on advice from the grid operators) that a metering point should be within a distance of 100 meters from the local unit it belongs to. 216 metering points are too far away from any local unit and cannot be classified when this rule is applied. Of those, only 55 are more than 200 meters from the nearest local unit. More than 70 percent of the metering points are within 50 meters from the nearest local unit. A local unit may have more than one metering point.

Table 12.

Distance Frequency Percent

Cum. percent

< 10 meters 432 14.7 14.7
10 - 19.9 560 19.0 33.8
20 - 29.9 426 14.5 48.3
30 - 39.9 382 13.0 61.3
40 - 49.9 333 11.3 72.6
50 - 59.9 202 6.9 79.5
60 - 69.9 323 11.0 90.5
70 - 79.9 110 3.8 94.3
80 - 89.9 88 3.0 97.3
90 - 99.9 80 2.7 100

Even though the quality of the organization numbers in the test data can be questioned, it is worth checking if the assigned local unit has the same organization number as the metering point. This check revealed that large businesses are missing in the Täby data. Täby has a large shopping mall with 50 metering points that are difficult to match. For example, a bank office in the shopping mall has unreasonably large consumption. It is probably close to one of the metering points in the mall, but it is not the local unit using the electricity measured at the point. Thus, large businesses cannot be distinguished because they are included in the consumption of the shopping mall. The minimum distance method clearly has many flaws, but could possibly be refined using more sophisticated tools for analyzing geo data.

Population register

The Population register includes an apartment key for each individual, which makes it possible to distinguish households and facilitates matching of metering points to households. The Täby data includes apartment identification as part of the address for metering points pertaining to households. However, buildings with rented apartments usually only have one metering point for the whole building which makes it impossible to measure the energy used by the households, unless additional information is available.

Data quality

There is no obvious was to evaluate the above results with only the test data, but the main purpose was to gain knowledge of importance for future work. In particular, Statistics Sweden is currently involved in the final phase of collecting the needs of Statistics Sweden and discuss them with the authority responsible for constructing and maintaining the data hub, Svenska Kraftnät (SvK). Some major quality issues under discussion are summarized below:

  • It is crucial that addresses are correct. The addresses will be of key value for linking to the BR since local unit identification is not available. The addresses are also important since they connect buildings, properties and households in the registers at Statistics Sweden. More than 300 grid operators and electricity suppliers will upload their data in the hub. When data are migrated, the addresses will be checked against the register at the Land surveying authority. However, there will not be any continuous checks as new addresses are added to the hub.
  • The quality of organization numbers will be carefully checked.
  • An attribute for type of metering point will be added to the hub. The purpose of the attribute is twofold. It will keep track of “multiple use”, i e if a metering point measures the electricity use of more than one user (apartment buildings, gallerias, etc). It will also give a detailed classification of buildings and property, and further classify object without an address, for example traffic lights and charging devices for electric cars.
  • The test data are monthly aggregates, but in the future Statistics Sweden is free to require any level of detail. The choice of detail depends on the statistics we wish to produce, but also on our ability to handle large amounts of data.

Hourly electricity consumption in municipalities by economic activity

Motivation: in order to help plan the production and import of electricity Statistics Denmark is calculating the consumption of every hour of the year 2015 for every of the 99 municipalities in Denmark, This calculation is done for households, private service, public service, agriculture, industry and others. This gives us in total 700 time series with 8000 observations in each.


The task is somewhat difficult as the number of smart meters in Denmark is still low. It is only the smart meters, which are billed according to hourly consumption, that are reliable enough to calculate the hourly consumption. In addition there are meters, that are billed periodically but have hourly observations too. Throughout the course of 2015 a great number of smart meters was added to the population of smart meters. The number of periodically read meters is around 2.3 mio.

Time series belonging to one smart meter is complete in a given month if the number of observations correspond to the number of days in a month multiplied with the number of hours. For instance for January the time series will be complete if the number of observations is 31*24=744. One can also formulate it like this: If a meter is there 100% of the time it will have 1*31*24=744 observations. Suppose a meter has been there 80% of the time during a month it will have 0.8*31*24=595 observations in one month. One can therefore check how many smart meters there have present in data each month and that comply with theta*number_of_days*hours. For varying values of theta: The number of meters present in one month is shown in Figure 28.

Graph X_ The number of meters per month satisfying theta * Number_of_days_month_hours

Figure 28. The number of meters present in one month.

Data preparation

Address linking

As described in a previous report we have linked the background information to our registers. After doing some additional checking and cleaning of the data this leaves us with around 4 mio. meters and 3 mio. addresses. There are around 200.000 meters that do not have an address. About 370.000 of the meters have been assigned 2 addresses or more. In order to make sure to have address per meter, we select the address meter combination that was valid in the first quarter of 2015. For this exercise we keep only meters that are valid in 2015.

Connecting a sector to the meter through the address_id

In the building register we have information about the building and which sector it belongs to. There are 100 sub-sectors that can be divided into 7 overall sectors: Households, Industry, Agriculture, Other, Unknown, Privately owned business and publicly owned business. As a supplementary check we add that if a meter uses more than 10.000 kWh a year, it is not a household.

Linking subscription information to the meters

Since a meter can both appear in the hourly read consumption dataset and in the periodically read dataset , it is important to know the billing contract of a meter to be able to know how reliable the hourly time series is for that meter. A meter that is billed according to its' hourly consumption is definitely an hourly read meter. The meters that are billed according to their manual consumption are checked if they also appear in the hourly read consumption dataset. If they have measurements every day and the annual consumption is the same in both datasets, the meters profile from the hourly dataset is used. If there are meters with infrequent measurements in the hourly consumption dataset, they are not used to make the hourly profiles. If a meter is in the periodically read dataset half the time and half time is the hourly consumption dataset, the meters consumption is summed over both dataset and the hourly profile is not used.


To create hourly profiles for all sectors, we have used the hourly profiles of the meters that were actually read on a an hourly basis to estimate what the hourly time series would have looked like, had all meters been hourly meters. First we identified the meters, which had an hourly profile through the entire year (n:60.000). Secondly we found the annual sum per sector and municipality for those meters. Thirdly we found the annual sum for the rest of the meters (>3 mio.) pr. sector and municipality. Fourth as we now had the amount spent by sector and municipality for the meters that are hourly read and usable, and the ones that are read off manually, we can find the relationship between the annual amount spent in sector a in municipality b for all meters and the hourly read meters. For instance sector a in municipality b might be spending x kWh in total and y kWh is spent by the hourly meters.then the estimated total of a given hour in a given municipality and a given sector will be amount spent in hour z*'y/x.



Figure 29. Hourly consumption of 6 sectors.

The figure shows the estimated total consumption over the year for all sectors by the hour. This a really good use of data, as it gives a clear picture of the demand for energy by sector and possibly be municipality. The sector other accounts for most of the consumption. The total consumption of 2015 is around 31 mio. kWh and half of it is accounted for by the hourly meters. The consumption of all sectors is generally higher during quarter 1 and 4 that is during winter. unfortunately the sector other takes up nearly half of the total consumption. The private sector is the second biggest consumer, households seem to be the third and public fourth while the consumer that spends the least is agriculture. There is clear jump downward during weekends.

Total annual consumption per sector per municipality - 2015



Figure 30. Total annual consumption per sector per municipality in 2015.

The total yearly consumption pr. sector is shown in the maps above (Figure 23). The consumption is given in 1000 kWh. In the upper left figure the consumption of businesses is mapped. there is not much variation in consumption region to region. In the center of the peninsula of Jutland one can notice small hints of darker shade, meaning that the electricity use is higher here. In the upper right map one will notice how the number of agricultural activities is either low or unaccounted for, the map is based on zip codes. Perhaps not all zip codes have agriculture. there is a big agricultural spot in the center right part of Jutland, where there is a darkshaded spot. In the lower left map the total consumption of industry is given. Except for one shaded spot it is evenly spread in Zealand.

Mean annual consumption per sector per municipality - 2015

RTENOTITLEMean agriculture.png


Figure 31. Mean annual consumption per sector per municipality in 2015.

Besides describing the total consumption the mean consumption per sector per is described (Figure 24). the findings is shown in the figures above. The lower right maps shows that in areas around the capital region of Denmark the average annual consumption is lower. The consumption as shown in the map is not adjusted for the number of people in the household. It is possible that the number of people in the household is lower in areas around bigger cities, where young single people live. The mean consumption of other sectors is more evenly distributed.

Multipurpose usages: Italian data

In Italy there are a lot of energy providers on the market and there is an overall provider, Acquirente Unico (AU) which is a public authority controlling the market of energy.

The strict law about data access for privacy make not straightforward to get data about energy consumption. Thus, Italy the law about data access has changed. In fact, the recent Budget Law (Law no 205/2017) in December 2017 introduced the free access to the Smart meters data (provided by the provider Acquirente Unico) for permanent Census purposes.

Consequently, Acquirente Unico (AU), a public authority which controls the market of energy, in the near future would provide Istat with information derived by billing summary referring to households and low power consumption businesses, but also information on the for the data control and correction procedures.

Accurate data on energy and gas consumption of resident households (in physical quantities per unit and on monetary value) would allow two main objectives:

• to support the correctness of expenses recorded in the data of the Italian round of the European energy consumption survey;

• to support the study of vacant dwellings and the correctness of information on population register for Permanent Census purposes.

Due to the lack of access to Italian mart meter data the analysis of multipurpose usage was not performed.

Identifying consumption patterns

The aim of two following studies is to show what are the potential usages of smart meter data. and we expect to demonstrate that the data can be used to get better understanding of consumer behavior in a specific time period and impact of certain changes in policy to energy consumption. We have two case studies to show first, what are the consumption patterns around holidays and second, to show what is the impact of daylight savings and switching between summer and winter time.

Consumption new years eve compared to usual evening

The aim of the study was to identify different daily consumption patterns around New Year Eve 2015.

Hourly electricity consumption data of private and business customers which have a smart meter. It was possible to extract 167800 distinct private customers which had 182944 smart meters and 6014 distinct business customers which had 13831 smart meters and their consumption data one week before and after the New Year Eve in 2015.

The consumption pattern of hourly aggregated consumption data of private and business customers is visualized in Figure 25. Consumption is normalized by the consumption value of the last hour of 2014 (hourly consumption measured at 2015-01-01 00:00) and the value corresponds to 1. It is is possible to identify clear pattern of working days of businesses and private consumers. December 25, 26 and January 1 are holidays, December 27, 28 and January 3rd and 4th are weekends, December 31st is partial working day and all the rest are working days. There is a peak of consumption before the end of the year and very low consumption at night after the year change, which is difficult to explain. Despite December 29th and 30th are working days as days January 5, 6 and 7th the consumption pattern is different and indicates that many institutions and enterprises are closed in time between the holidays.

WP3 Est new year norm.png

Figure 32. Electricity consumption of private and business entities around New Year Eve normalized by the end of year value. Red - holidays, blue - working days.

In Figure 32 We can see that there are clear patters that correspond to specific type of working or holidays.

Daylight saving time (DST) consumption a week before and after the switch

To analyze the consumption before and after the adjustment of time we select a sample of the available units. We randomly select a share of the households metering points and a share of business metering points within each municipality. We wish to analyze whether there is an effect from adjusting time, when going from winter to summer time. The reason for having implemented this was to obtain a reduction in consumption. The idea is that if the day is longer it will have an impact om for instance the demand for lighting. Since the switching time occurs right around ester, it makes no sense to investigate the effect, as it will be difficult any change to the fact that we have had a change in time. In the graphical examples below we investigate what happens as we move from summer back to wintertime, as it does not happen around a holiday.

It is possible to analyze the problem with a regression discontinuity model as provided below.

log(elec_consdy) = β0 + β1DSThdy + θXhdy +f(daysdy)·1[DSThdy =0] + g(daysdy)·1[DSThdy =1] + εhdy  

The advantage of using an regression discontinuity design is that one is able to follow the same units over time, so the problem of the control group and treatment group being different in any ways, is eliminated, there is no confounding problem because the control group is the treatment group. Instead one has to deal with the problem of confounding factors due to different time periods. In this case the switch to summertime happens in the week before the Easter holiday. The first official public holiday is Thursday the 2nd of April, the last day is the 6. of April. It is then only really possible to compare Monday through Wednesday in the Easter week. Another problem that could arise is that weather conditions could be different in the time before the switch and the time after the switch. In this section we will not explore the model, but merely explain that it exists.

Graphical representations of the time switch from summer- to wintertime


Figure 33. Electricity consumption around switch.

In Figure 33 one can see the consumption from the 18th of October marked with the read graph and the week after the switch marked with the blue graph. It is difficult to conclude anything about the behavior of this particular consumer of electricity.


Figure 34. Consumption pattern around switch.

In Figure 34 one can follow the daily consumption in the week before and after the switch, together with the total daily consumption. The point of having the DST is that it helps save electricity. It is difficult to conclude anything from this one metering point.

From the study we can conclude there is not very clear evidence of saving electricity by applying daylight saving time policy.

Summary of potential usages

The electricity data has a great potential as a complementary data source for different tasks. During the project, only two countries had access to the data – Estonia and Denmark. Austria, Sweden and Portugal countries got access to sample data and Italy did not get access to the data. Based on the study we made the main conclusions.

  • Classification and clustering
    • Supervised learning can be used if there is available high quality data available for labeling the training data.
    • Zero or low consumption information can be used for identifying empty dwellings.
    • By using unsupervised learning we can find cluster centers, which correspond dwellings with specific characteristics.
  • Electricity Business activity and regional electricity statistics
    • Attempt to use electricity data in tourism statistics did not show good results as due to climate the highest energy consumption is in winter, but high season for tourism is summer and some further work is needed for using other methods for cleaning the seasonality from the data.
    • The Danish dataset was used for produce regional statistics by economic activity and based on the Swedish dataset potentially produced regional statistics were evaluated.
  • Identifying consumption patterns
    • Electricity consumption data can be used to explain efficiency of certain policy.

Other smart meters

In addition to the electricity smart meters, the potential usage of other smart meters is discussed in the following chapter. Due to the fine granularity of the smart meter data new possibilities of data usage are possible.

Water, gas meters

The situation for gas smart meters is quite diverse and there are no EU wide roll-out plans [2]. Five member states (Ireland, Italy, Luxembourg, Netherlands and the United Kingdom) have a roll-out plan for gas meters by 2020 and in two more countries (Austria and France) the decision for the roll-out plan is pending. In countries with a wide scale roll-out consumption aggregates could be computed in a similar way as described for electricity smart meters. The data can be used for identifying the amount of Import / Export of gas.

There is no EU wide plan for the roll-out of water smart meters. The device itself measures the volume of water by measuring the flow similar to a gas smart meter and can be installed directly at the customer or at different points in the water grid.

For the statistical purposes again the consumption is an obvious estimation domain, but also the vacancy estimation (tested in this project for electricity smart meters) could be an interesting field, since the 'automated' usage of water in homes should be very limited and in apartments almost non-existing.

Heating/cooling smart meters

A heat meter is used in a district heating installation and it measures thermal energy provided by a source or delivered to a sink, by measuring the flow rate of the heat transfer fluid and the change in its temperature between the outflow and return legs of the system. Quite often the same meter can measure both heating and cooling.

A heat meter consists of a flow sensor/meter; a temperature sensor pair for measuring the temperature between the outflow and the inflow; and the calculator which calculates the accumulated volume and energy.

Currently it was not possible to find any references of the use of smart heat meters in statistics but it is noted the use of the smart thermal meters could reduce cost. Several benefits are seen by using smart thermal metering [3]: accurate accounting, billing, and end-user management; optimization and control of energetic systems; and fault detection. In the Energy Performance of Buildings Directive [4] it was noted that the billing, to occupants of buildings, of the costs of heating calculated in proportion to actual consumption, could contribute towards energy saving. Installation of 56,000 smart heat meters of the district heating supplier of Denmark’s second largest city, AffaldVarme Aarhus [5], helped to reduce the water loss more than one third.

Weather stations

Automatic weather sensors are widely spread across Europe and the World. Climate record is an important task since all countries need to manage several activities related to weather conditions. In the last years, we observe a transformation in the way countries measure different weather elements from a manual reading to a more automated one. Currently, countries use more and more automatic sensors that can send weather information in periods quit small.

In Portugal, the automatic weather net is functioning since 2002 with 93 stations (In mainland, Madeira Islands and Azores islands). Each station records main weather elements in intervals of 10 minutes. This information is automatic transformed in code messages that are transferred to the mains station in 60 minutes intervals. The weather elements that are recorded are:

  • atmospheric pressure;
  • temperature;
  • relative humidity of the air;
  • wind direction;
  • wind velocity;
  • precipitation level and duration;
  • temperature at soil level;
  • temperature at 5cm above soil;
  • global solar radiation

The community has been using this information for several studies and for practical uses mainly in:

  • Agriculture – To improve efficiency; to improve productivity; to determine the best moments for the various farming practices;
  • Transports – To determine best schedules; in aviation to determining best routes;
  • Services – In some weather depending businesses (sports, drive through, parks,…) to establish best schedules and to allocate the necessary human resources

The statistics offices can use these rich weather smart-meters to improve statistics in several areas like:

  • Environment indicators - direct use;
  • Electricity consumption – Input for estimates on electricity consumption; studies on relation between climate and electricity consumption;
  • Agriculture - Input for estimates on agriculture production;
  • Tourism - Input for estimates on tourism;
  • Transports - Input for estimates on mobility

Device specific consumption e.g fridge, tv or smart plug

Plug-in Power Monitors

There are a lot of devices to monitor power consumption in the market. Some can be attached directly to the main counter or to the electrical control panel of the house. Some devices can save data consumption history in order to have a clear idea of the domestic use of the energy. These devices can be useful for users to know their consumption in real time and choose the best billing rates.

For example, the Wiser S counter has two main sensors, one 80 ampere sensor to be connected to the main counter and a 50 ampere sensor for the electrical panel. Other sensor can be added.

There are also a lot of device-specific counters which are connected directly to the power socket and the device whose consumption we want to measure. These devices are simpler and generally cheaper. A digital display shows the instant power consumption and, in some case, the peak of energy. The more expensive the device is the more precise the measurement will be. Some can be programmed with the cost per kWh of the energy provider.

There are some more sophisticated (and expensive) devices that still doesn't require complex installation and must be attached to the main counter. They saves the power consumption recording the signals generated by the counter. The data is saved in an internal storage and can be downloaded with a simple USB stick. A software is provided to view and analyze this data in order to show power consumption in various time slots, useful when consumers can choose daytime based billing rates.

Automatic tracking of waste collection

Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. In this setting, RFID is used for identifying waste bins and measures the weight of the waste in the bin. Every waste bin has an RFID tag, and every waste collection vehicle has a reader. As the bin is emptied (not manually), the vehicle reads the tag and registers identification data and weight. The data are transmitted to a database in short intervals. The data are used for billing purposes, but there is potential for improving statistics on waste and define indicators for environmental goals.

Currently the system exists in 30 municipalities in Sweden. Statistics Sweden is running a pilot project where data from RFID tags will be delivered by an API and evaluated.

Summary of other smart meters

There are in use several different smart meters which have potential to provide data for producing statistics or to be a supplementary data source:

  • The water and gas meters can be used for identifying vacant dwellings and doing energy statistics,
  • the heating meters could be used in the energy statistics,
  • the weather stations information can be used as a supplementary source to evaluate impact of weather to energy consumption,
  • the data of power monitors can be used tracking energy usage, and allow customers to better understand their energy consumption and wisely choose billing rates.
  • the waste bins trackers can be used for environmental statistics.

Feasibility of the use of on different level aggregated data

Electricity smart meter data is about a metering point - an observed unit - and statistics is produced about a business, household or dwelling - a statistical unit. Smart meter data is a time series of recordings of consumption of electric energy in intervals of an hour or less. Raw smart meter data contains usually a metering point id, timestamp of recordings and recording of consumption in a fixed time period. In addition there is information about the location of the smart meter and customer information who has signed contract with an energy provider. A smart meter is usually located in a building or in a part of the building, which has an address. Address information is relevant for identifying the end consumer as the contract owner may not be the actual end consumer. Aggregation level and additional metadata is relevant question when a NSI gets access to the smart metering data. Depending on what is the aggregation level of data and what kind of additional metadata about a metering point is available determines what are the possible statistical products and analysis that can be done. Data can be aggregated in time scale or in space scale or when the end consumer is identified in some other scale (e.g., area of activity of businesses). Using a proper aggregation level is also relevant in processing the data as it is much efficient to produce monthly statistics from monthly level aggregated data than from the raw hourly data.

Aggregation methodology

As the smart meter data is held in tables of databases simple aggregation methods are used. For aggregation the GROUP BY clause is used which gathers all of the rows together that contain data in the specified column(s) and will allow aggregate functions to be applied on the columns. In following examples Column1 are spacial (address, municipality) or temporal (day, month) features based on what Column2 (consumption) is aggregated.

Syntax in SQL:

SELECT column1, SUM(column2) AS Sum FROM table GROUP BY column1 

Syntax in PySpark:


In case the amount of data is very limited pivot tables can be used.


For aggregation a space or time dimension can be used. The raw smart metering data without any additional metadata contains only metering point identifier, timestamp and recording and this data can be used for classification, finding general consumption patterns and calculating full consumption in different time scales. When address information of the metering point or consumer information is available then regional and the end consumer based statistics can be done. When the address information is only in a regional level it is impossible to find end consumer, but it is possible to calculate regional consumption of electricity. Next tables indicate to which level of aggregation a certain type of product can be produced (e.g., seasonally vacant dwelling can be identified if the aggregation level is equal or lower from quarterly level and it is difficult to identify seasonality from yearly data).

List of possible products

  1. Total consumption in time scale
  2. Classification
  3. Vacant dwellings
  4. Vacant dwellings with zero consumption
  5. Statistics by business activity area
  6. Regional statistics
  7. Seasonal statistics

Table 13 indicates what kind of statistic is possible to do with different level of aggregated data. As there is available monthly aggregated data with address information then this data can be used to identify seasonally vacant dwellings and do regional statistics of energy consumption in different business areas, but it is not possible to classify dwellings by their hourly consumption patterns.

Table 13. Aggregation levels of smart metering data and possible statistical outputs

  Metering point Address Region
Raw data (hourly or less) 1, 2, 7 1, 2, 3, 4, 5, 6, 7 1, 6, 7
Daily 1, 7 1, 3, 4, 5, 6, 7 1, 6, 7
Monthly 1, 7 1, 4, 5, 6, 7 1, 6, 7
Yearly 1 1, 4, 5, 6 1, 6

Aggregation quantification

Randomly selected 1000 metering points from the 2016 metering data, stored in Oracle Database and extracted data is saved as a csv file. Columns are - year, month, day, hour, metering_id, prod and consumption. Table 14 presents the file sizes and processing time needed in different aggregation levels.

Table 14. Aggregation levels

  Table Size (KB) File Size (KB) Processing time (s)
Quarter minute 1.447.035 1.138.681 4
Hourly 335.872 265.985 1
Daily 14.336 10.683 0,178
Monthly 448 311 0,094
Yearly 128 18 0,089


The aggregation level and availability of metadata of a metering point determines what kind of products can be produced from the electricity data. The more detailed information is available, the more products can be produced, but at the same time the more storage space and computing power is needed to handle the data. To speed up the calculations it is recommended to use temporary tables of aggregated data even in the case the raw metering data is available.


Main findings

The aim of the study was to introduce new potential uses of smart electricity data for producing statistics, provide a list of other smart meters and analysis of aggregation levels to potential usage of the electricity data.

The electricity data has been used as an additional source for tourism statistics. On the country level there was possible to identify opposite seasonality in electricity consumption and tourism activity. More detailed analysis can be done regional level when the data becomes available.

Several pilots use classification methods to identify vacant dwellings, seasonally used dwellings and discriminate whether electricity in consumed by household or company. The results are promising and would improve when more smart metering data will be available. There is comprehensive overview of different methods and their performance in classification task.

Regional electricity consumption by different sectors of economy gives insight regional activity in different areas.

The other smart meters could also be as additional data sources (e.g., heating meters for measuring consumed energy).

Summary of aggregation levels gives hints to other statistics offices what are the possible products that can be produced from certain level aggregated data. The resources needed to process the data are depending on the aggregation level of the data.

Problems encountered

The electricity data has a great potential to produce a new kind of statistics, but the implementations also had some setbacks. There were several legal, procedural and technical issues making access to the data complicated. In our studies in many cases only a sample of data was available or the proportion of the smart meters data was was high enough.

Future work

There is a potential to improve the results as more countries could get access to the data and if the quality of current datasets improves.


  1. Carroll P. et al. (2018). Household classification using smart meter data. Journal of Official Statistics, 34(1), pp. 1–25. Doi:http://dx.doi.org/10.1515/JOS-2018-0001
  2. European Commission (2014). Commission staff working document SWD/2014/0189 Cost-benefit analyses & state of play of smart metering deployment in the EU-27 Accompanying the document Report from the Commission Benchmarking smart metering deployment in the EU-27 with a focus on electricity. (Pdf, 1.59 MB)
  3. Celenza L., Dell’Isola M., D’Alessio R., Ficco G., Vigo P., Viola A. (2013): Metrological analysis of smart heat meter. Flomeko 2013, Paris.
  4. European Parliament, Council of the European Union (2002): Directive 2002/91/EC of the European Parliament and of the Council of of 16 December 2002 on the energy performance of buildings. (Pdf, 2.24 MB)
  5. Kamstrup (2017): New possibilities with smart metering. Kamstrup A/S.