Data Collection & Cleaning


A project needs clean, reliable data before any research or analysis can happen. In many cases the data we use has already been created and collected by other parties, often by government departments. These public datasets still need to be refined and cleaned so that they’re useful. This is a crucial stage of most research work but it’s often overlooked because it is not glamorous, is often boring, and can be time-consuming. 

We are gigantic nerds with years of back-office experience, so we don’t mind this part of the work. We are old hands when it comes to cleaning data in Excel and presenting it in a format that you can use, whether it’s filtering and sorting data, building pivot tables, or even automating your workflows through VBA macros. 

We have also become adept at creating our own data sources and at working with our partners to transform their data into useful information and valuable insights. If the information doesn’t exist we can find a way to create it or model it for you. One easy way to collect data is by creating targeted online surveys. We have created surveys using a variety of software packages that range from freeware (Google Forms) to specialised web-based application software (Caspio and Zoho).

Case Studies

Renewable Energy Independent Power Producer Procurement (REIPPP) Programme supplier database

ED Platform provides advisory, monitoring and reporting services in the energy and infrastructure spheres. The company consults to a number of independent power producers (IPPs). We worked with ED Platform and provided data analysis as part of the company’s overall service offering.

As part of its monitoring and evaluation work ED Platform evaluates the supplier / procurement data of its clients. In partnership with ED Platform we created a supplier database using procurement data of over one million transactions for 20 IPP construction projects. 

Cleaning such a big dataset required the use of different tools, including Excel and OpenRefine. The cleaning included the removal of duplicate data, fixing data that had been incorrectly captured, and a bit of extrapolation in cases where the data fields were incomplete or missing.

 Once the dataset was cleaned it was visualised in Microsoft Power BI and CARTO. We created a geospatial tool that allows users to locate suppliers across South Africa by industry and B-BBEE empowerment rating.

Community newspaper survey

TRi Facts is the commercial training and research division of Africa Check, Africa’s first fact-checking organisation. We worked with TRi Facts to create a survey for community newspapers across southern Africa.

We built the survey in Caspio and we deployed it using an online link so respondents would click on the link and could start taking the survey immediately. Caspio allows for the creation of dynamic surveys so that certain sections can be hidden from the user unless they respond in certain ways to previous questions and sections. This reduces confusion and directs each users through the survey.

The responses from the survey users can be exported to Excel or any database for analysis. We gathered responses from about 70 community newspapers and analysed them with the TRi Facts team. From this analysis we were able to tell a story of community newspapers in the region: how many staff were employed, what was a typical skills set within a community paper, and so on.