We are dealing with a lot of data

We are dealing with a lot of data

Since its creation in 2007, the European Research Council has been supporting the broad principle of open access to the published output of research as a fundamental part of its mission. ENERGYA has opted for the most open solution, following the Open Research Data pilot, which means that the project will make its data, metadata and tools all available for free, if possible. This ERC protocol specifies that we will:

  1. deposit in a research data repository […]  for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user […] : the data […] and other data, including associated metadata, as specified in the data management plan;
  2. provide information […] and — where possible — provide the tools and instruments themselves.

Our Data Management Plan describes the types of data that we will be using and which data that will be made available in our Database under Energy and Climate data.

While our intention is to make as many outputs of our research as possible freely available, we still have to respect the property rights of the data we are using as input, which depends on the policy of the institute that maintains the dataset. For example, ODYSSEE, EnerDemand, NATIONAL SAMPLE SURVEY ORGANIZATION (NSSO) , and SUSENAS, are very expensive datasets maintained by a specific company (ENERDATA) or by national statistical offices (India and Indonesia respectively). When raw data cannot be made open access through our website, we still can and will report some aggregate patterns.

Another interesting point to note is that some of the data we plan to use has been around for a while. Yet, it has not been used to understand adaptation.

For example, the Environmental Policy for Individual behavior Change (EPIC) survey was conducted by the OECD in 2011. It is a very rich survey about households’ energy consumption behaviors, including adoption of technologies relevant for adaptation such as air conditioners or thermal insulation. The ODYSSEE and the EnerDemand databases from ENERDATA contain information on the use of specific energy services, including space heating and cooling, and this source of data can shed light on the use of AC in relation to weather conditions.

ENERGYA will combine these data sources on energy with climate and weather data to identify patterns and causal effects, as depicted in our nice infographic.

On the left part of our infographic you can see all the data we are collecting in this early phase of the project, while we work on the definition of a framework.

Our main data source for climate and weather data will be GLDAS-2. The challenge with climate data is storage and speed of processing to compute specific climate indices relevant for health and energy starting from the raw climatic variables. Climatic data have a very high temporal (3 hourly)  and spatial (grid cells of approximate 25×25 km) resolution. Even if you end up working with aggregated data – either over time or spatially to the country level – you need a huge amount of storage and computing capacity. For this reason we exploit the collaboration with CMCC and its supercomputer Athena located in Lecce with it’s almost 8.000 cores in total capacity, capable of a peak performance of 160 Tflops, and offering a total storage capacity of 600 TB of disk space.

ATHENA: The IBM iDataplex supercomputer based on Intel E5-2670 multicore architecture and InfiniBand FDR interconnection is integrated with two DDN SFA10000 storage subsystems capable to offer a storage capacity of up to 840TB in total and an I/O performance of 6GBytes/sec per disk array.

Our plan is to develop a dataset covering a range of climate indices relevant for health and energy, for each 25×25 km2 grid cell globally, both over historical past and future time scales. Once this database is published, which will be the first of it’s kind at global high resolution, it will be made publicly available to the scientific community.