Spreadsheets, in particular, Excel are probably the most widely used types of data analysis software. Everyone uses them with varying degrees of success. The more advanced data analysts, those proficient in say, SAS, SQL, R or Python may look down on them; but they have many virtues. Not least in their simplicity of use and visibility of both data and analysis (although see below).

I recently gave a lecture discussing some of the pros, but mainly, cons of spreadsheets (mainly Excel) from a fairly experienced data analyst point of view. It was mainly concerned with their use as a pseudo database / data analysis tool; here are some of the main points.

Problems

  • They are too easy to use – so people use them badly

One of their main pros is indeed arguably their main con. They are so easy to use that people who don’t really understand simple formulae end up massively over complicating things. They further encourage lazy behaviour from users. For example, it is easy to hard-code simple formulae without thinking of the long-term consequences. Furthermore, their lack of rules on data types and structures can also result in it being far too easy to make mistakes. A list of some famous Excel mistakes can be found here, for example.

  • No separation from data / analysis

A main gripe from genuine computer architects is the lack of separation between data and analysis. With Excel and indeed all spreadsheets, all the data, queries, graphs, reports etc. is usually held in the same workbook. This goes against much the principles of Data Science and Management teachings. Data should be protected and stored away from any analysis / reporting and kept immutable in this way

A further issue is the fact that all formulae are coded inside of cells instead of being “virtual” as they are in a piece of code. This can lead workbooks becoming far too large and difficult to manage. Error checking also becomes difficult.

  • There is little in-built data integrity

The flexibility of spreadsheets again can be their downfall. For there is nothing to stop multiple field types being in any column / row. This can obviously lead to data integrity problems. Now you can protect cells, put in error catches etc. but there is nothing really to mimic the field definition, constraint restrictions that exist in a structured dataset, like a SQL database.

  • It is difficult to join data sets together across worksheets / workbooks

One of the advantages of databases is their ability to join tables easily. The only real equivalent in Excel and others is Match, Index and V and Hlookups, and the more generic Lookup – although the latter three I don’t like but everyone has their own prejudices.

(See, https://www.exceltip.com/lookup-formulas/6-formulas-to-lookup-in-excel.html for example usage). However, these are quite cumbersome to use and can become very slow.

More problematic is that to use these lookup functions, in order to match datasets across the workbook, or produce summary statistics and the like, you normally have to ensure that the field / column values are repeated in each relevant row. So, you can end up with many cells unnecessarily populated, and a vastly bloated workbook. Contrast this with a normalised database, with explicit aim of saving space.

  • Chains of calculations across worksheets and workbooks

Both novice and advanced spreadsheet users fall into this trap: building workbooks where there are formulae upon formulae across workbooks. When this happens, workbooks become then very difficult to navigate around. Moreover, these chains make error checking very difficult – you can end up searching through multiple worksheets to find what the source is. Even with built in tools this can be very laborious. Just as problematic is how calculations begin to slow down as some of the formulae continually recalculate (to their credit, Excel developers have dealt with some of the issues with volatile functions and so on, but look-up functions just seem to stop working after a while – for more detail on volatility see, for example https://www.fm-magazine.com/news/2019/jun/microsoft-excel-volatile-functions-201920646.html).

  • Simple SQL-like queries are often hard to construct

Trying to build simple GroupBy queries with criteria in Excel can be very difficult. There has been a huge improvement in later version of Excel with the gradual replacement of array formulae with CountIfs, AverageIfs etc., (see, for example, https://exceljet.net/excels-racon-functions) but it’s still difficult to use. Furthermore, it’s very slow on larger datasets, and we still suffer from the problem that you can’t easily join datasets (see above).

  • Version control issues

A big bugbear is when you have multiple copies of the same spreadsheet being used by different people in the same organisation. Now, with OneDrive etc., collaboration is encouraged but this leads to its own problems. So, people just tend to copy the “main” spreadsheet and use that. Version control becomes difficult to manage and the data is definitely not protected.

  • Referential integrity issues

In a well-constructed database, deletion of records will cascade, so there won’t be “hanging” records (see, for example https://database.guide/what-is-referential-integrity). However, unless much care is taken, this is not the case with spreadsheets. E.g., you may accidentally delete a cell from one worksheet, or simply just change its value, this may cause errors in any cells that point at it. And tracking back the cause of an error when you see #REF! is no fun.

Obviously, you can put data validation on various cells, but it is very easy to forget to do this, since it’s not an inbuilt requirement. Contrast this with strict database table definitions.

Benefits

  • Visibility, visibility, visibility

In my opinion, their greatest benefit. Because everything is open with easy-to-use GUIs, you can always see what you are doing. You don’t have to scramble through pages of complicated code. I almost always try things out in Excel before opening a Python text pad.

  • Ease of use

Again, often cited as its main pro, it is incredibly easy-to-use. It has opened up simple data management and analysis to all. It is flexible with regards to type of data it can read, and doesn’t require the import of various module to conduct forms of analysis

  • Excel is faster and has a greater range of functions than you may think

With carefully designed workbooks (and the minimal number of volatile functions), Excel can process data very rapidly. Furthermore, the range of its statistical functions is rapidly growing

  • It’s can be used to construct syntax for other languages

Excel’s text functions, coupled with its easy autofill make it perfect for constructing syntax for other scripting languages. It is easy to put together a few lines of SQL, or SPSS syntax in different cells to construct an entire script.

  • They are ubiquitous

Not sure this is a definite benefit, but whether you like it or not, spreadsheets are used everywhere and by everyone. So, any good data analyst MUST be fluent in spreadsheet use.

As you can see, there seem to be far more cons than pros. However (and, as I told my students), in many respects, some of the criticisms levelled at spreadsheets are a little unfair. In my experience, most problems seem to occur due to a lack of experience or aptitude of the user. Generally, if care is taken with their construction, they can be used effectively as a data management and analysis tool, and, for small datasets, they can mimic some of actions of relational database. However, if you require a long-term protected data repository, it’s probably best to look elsewhere.

Leave a Reply

Your email address will not be published.