Using Python to automate a complicated analytics data flow
A large yearly survey (circa 60,000 respondents and 1,500 fields) was causing major data management and analysis headaches due to the wide variety of software tools being used to process and analyse the data. We streamlined the entire process, removing the reliance on previous tools, whilst reducing time, space requirement and errors. Furthermore, our code can be applied to future surveys with minimal alterations required.
Our client was conducting fairly complicated analysis for their main yearly survey using a combination of SPSS, MS Access and MS Excel. Despite the various processes being streamlined over many years (so there were very few amendments ever to the analysis spec), the whole procedure still involved many manual imports, copy and paste operations, and complicated Excel look-up functions. This proved cumbersome for the analysts to replicate each year, and prone to error.
We used Python to construct a normalised SQLite database, extract the data from the original SPSS file, clean it and upload to said database. We then constructed generic Python functions and procedures to mimic the existing analysis and export the results into a workbook in exactly the same format as for previous years. Since many of the queries and results were similar cuts of the data, we could use many of the same routines repeatedly.
This new methodology had multiple advantages:
- Space: the SQLite database was around a quarter of the size of the Access database and a sixth of the size of the SPSS file
- Manageability: although Python doesn’t have the advantages of an easy to build query GUI, it is much easier to construct multiple queries and similar pieces of analysis using the same piece of code
- Separation of analysis from data: one of the main criticisms of Excel, say, is that often data and the analysis are stored in the same workbook. Here we had complete separation of data, analysis and output, making alterations and debugging much simpler (see ExcelProblems blog post for more details on this)
- No copy and pasting: we had a complete data pipeline from SPSS dataset to SQLite database to output, thus removing the need to manually move data from one place to another manually
- No hard coding of formulae: all analysis was written in code, making errors easier to spot
- Repeatability: the analysis can be almost entirely repeated in future years with minimal changes to the code required. A priority since this was a yearly survey