Data Hub processing: spreadsheet data

Goal: Technical processing of spreadsheet datasets on flows, stock and demographic data, in a machine readable format, on the CityLoops Data Hub.

Note: The processing in this chapter refers to the technical processing on the Data Hub. If you need to process your raw data first with the goal of bringing it into the same unit, please refer to the chapters on “Conversion”, “Aggregation and disaggregation” and “Downscaling”.

Approach

  1. Prepare: Ensure that you have prepared your data sufficiently, so that it is in the correct weight based unit for your reference year. This also includes that you need to sort out the type of information:
    • For example, this file has the information of (1) location of actors and (2) their employees collected in one file. Now is the time to separate info into as many files as needed, instead of having bundled up info. You can keep the same source, but you need to format the files in different ways.
  2. Check that reference spaces exist: Check if you have already added the reference spaces for which you will upload data. The reference space needs to exist FIRST, before you can add data for that reference space. For example, if you have data on Eastern Finland, but there is no such reference space yet with the name “Eastern Finland”, then the system cannot connect this data. You should then go back and add and process the shapefile that will produce “Eastern Finland”.
    • Important: Be sure to use the correct spelling for the reference space. In the video, it was accidentally stated as “Hoje Taastrup”. It should be Høje-Taastrup in this case.
  3. Bring into correct format: Bring the data into a machine readable format, in the form of spreadsheets, with the correct ways to input data for each of the columns (units, time, quantities). The format (and therefore the template) depends on the type of the dataset that you have. There are four general types of queues
    • Flow
    • Stock
    • Demographic (people as stock)
    • Economic Sometimes it is easier to copy the raw data into the template. In other instances, you may want to clean up your file, see below.
  4. Correct order: Ensure that the columns are in the the correct order or that at least all the info is there (the order of the columns can be changed on the Data Hub during processing).
  5. Use uploading process: Check for the respective type of data, flows, stock and demographic data, where you need to upload the data and what the exact processing steps are for each in the respective chapters:

Flow data processing

Stock data processing

Population data processing

Important note: For the format of the data in column A, ensure that it is in the correct format of YYYY-MM-DD. This was not clearly shown in the video. Instead of just adding the year (e.g. 2016), it should be a specific date and written for example like, 2016-01-01, if the data is from 1st of January 2016 or 2019-12-31, if it is from 31st of December 2019. See also the example here.

How to clean up a file

Your raw data may come in various different files, formats, languages, text encoding standards, inherent formatting, linked cells and formulas etc. Therefore, you will need to clean up your file (a copy of it ideally):

  • Focus on your file tab by tab and page by page.
  • Remove all the extra info or notes at the bottom and top of a file
  • Remove extra (empty) tabs
  • Remove unnecessary info (but note it for example in the description of the file on the Data Hub
  • Unmerge merged cells
  • Remove any title rows so that headers are in row 1
  • Convert your numbers so that decimal points are points and not commas
  • etc.

Do you want to test cleaning up a file?

You can get started on this rather easy one.

Test your skills on this rather complex one - those exist too.

Tools

Outline of the video

  • What does it mean to process data? It is one of 3 major steps of the SCA and also of the Metabolism of Cities Data Hub.
  • M2:33: Processing means to convert a document that has been attached or is part of the library into a machine-readable format that our website can understand and can then visualise in a number of different ways instead of just a static image.
    • Convert raw data into data and information points that our system can use, which can then be presented, manipulated and then used for modelling and analysis in the third phase of the Data Hub.
  • Main goal for this step is to put the info into a machine readable format
  • Other important goals are to make visualisations and link data to locations (to a house or waste treatment for example)
  • M4:26, It also helps us to extract the data for Sankey visualisations and indicator calculations. Once that data has been processed, then it can be analysed.
  • M4:31, Examples of visualised data (flows, stocks, population) from the Data Hub
  • M11:14, What needs to be done to get those visualisations? The data needs to be put into a format in the form of spreadsheets, with the correct ways to input data for each of the columns (units, time, quantities).
  • Format depends on the type of the dataset that you have. There are four general types of queues
    • Flow
    • Stock
    • Demographic (people as stock)
    • Economic
  • Each has their own template that is used to put the information into the correct order or to at least ensure that all the info is there (the order of the columns can be changed on the Data Hub during processing)
  • M12:43, We’ve made a single spreadsheet with all the templates needed for the various spreadsheet types.
    • You can also get the files from the links in the overview tab.
    • There is also an overview tab with filled examples (especially the one on segments, which is a drilling down option for materials)
    • You can only VIEW this file, but should be able to download it, so that you can sort your data accordingly.
  • M16:18, Remember the importance of reference spaces: The reference space needs to exist FIRST, before you can add data for that reference space. For example, if you have data on Eastern Finland, but there is no such reference space yet with the name “Eastern Finland”, then the system cannot connect this data. You should then go back and add and process the shapefile that will produce “Eastern Finland”, as you have learned in Module 1. (Overview of NUTS 2 and NUTS 3 that already exist as reference spaces for the cities)
  • M18:35, Something else that you need to sort out is the type of information:
    • For example, this file has the information of (1) location of actors and (2) their employees collected in one file. Now is the time to separate info into as many files as needed, instead of having bundled up info. You can keep the same source, but you need to format the files into different ways.
    • In this case, the actors location’s are needed as GPS coordinates type in one file and the employees as a dataset for the economic data queue in another. The system can later on match them again through reference spaces.
  • Main takeaway: we will focus on reshaping data so that the system can understand.
  • In other videos we will dive deeper into the different types and give examples for each.