Over the past few months, we have developed a system for a client to automate the downloading and cleaning data from Google, YouTube and Facebook Analytics using the relevant APIs. (The system further uploaded the cleaned data to a cloud database for them.) It saved the client from either using something like Supermetrics to get the relevant data (which still involves a fair amount of user input), or manually downloading from the various web-analytics back-ends. Since what we have built is easily customisable to other users’ needs, we have decided we give a short explanation of how we constructed it (the project itself can be found here).

We will only give a top-line explanation of the processes involved with developing the system, for those who are thinking of getting a similar system designed (or are thinking of building one themselves!). For more expert users, there are a few technical guides which can be viewed on Medium, for example, plus extensive documentation on each of developer’s sites. (We would of course answer any questions that people send to us).

We developed the whole system in Python so we’ll be explaining the process from that standpoint. Other languages may have their own quirks. We assume that the reader is familiar with some web-analytic terms, e.g., metrics and dimensions. We note that we could only achieve any of this as had administrator rights for the site. This allows you to access the various APIs we required to report on the various web-analytics.

For each of the reports we created, the general process can be summarised as:

Get an API Key / page token / secrets file
Create a generic reporting module to get required report you wish to make using said API
Create look-up dictionary of metrics / dimensions you wish to download
Iterate over dictionary / .json file to download all required data and turn into Dataframe (or other data-set)
Clean Dataframe and reformat, to either export it or upload it to a database

Get an API Key / page token / secrets file

The initial set-up involved registering a new application on the relevant developer’s site (using the relevant user name and passwords), then generating an API key and a “secrets” file for the particular application. This allowed to use the particular APIs and download the data we required. We needed to make sure that our application could access all the APIs it needed. This wasn’t always clear up-front., e.g., the YouTube API contains some of the information on the videos we were interested in, whereas YouTube Analytics API contains the various metrics.

For GA and YouTube, the secrets file alone were suffice. However, with Facebook we also need a “page-token” for each page we were intending accessing data from. This could be quite unwieldy if you are administrating many pages and contrasts sharply with GA where you can access data for multiple view (i.e., sites) with the same secrets file.

On each developer’s page, there is a simple(ish) GUI to test the metrics / dimensions you wish to download. This is particularly useful with the YouTube API as many of the metric / dimension combinations are not permitted. We made extensive use of this as a testing facility.

One piece of good news that we discovered was that you can make thousands of API calls a day to most of these analytics websites. The exception being Facebook Ads which bizarrely limited your use quite extensively, thus severely hindering development (we don’t discuss this here).

Create a generic API module

Each of the APIs have various routines that you can use to call their different reporting mechanisms, be they Facebook page or post insight metrics, or Youtube viewing data. Our first step was then to create a generic module that could call the relevant reporting routine. The construction of such modules are fairly well-documented on the developer’s sites, often with relevant code on GitHub pages.

We felt it easier to separate the actual creation of an API object from a reporting object, although this was not strictly necessary; indeed, most of the guides simply show you how to create the reporting object in one go.

In all cases, the key parameters required were the start and end-date for the report plus the relevant metrics (and dimensions) we wished to download. Some of the API calls also required some other specific parameters: e.g., Facebook Page Insights require what period to return the data be it day, week or 28-days; Google Analytics require a view id. Each routine will return the data for the dates you specified as a dictionary / .json file for you to inspect and reformat.

(To be honest, Facebook doesn’t really have dimensions to pass per se, the period parameter replaces this to some degree.)

We mentioned above that in that not all combinations of dimensions / metrics are permitted in the YouTube API. Furthermore, for some metrics, we had to use filters to get them to be returned. E.g., for many of viewing metrics, we needed to include a sort on the estimatedMinutesWatched metric. Quirks such as this made the working with the YouTube API far more difficult than it should have been.

Having said that, the creation of the routines for the Facebook Post Insights was by far the most involved. If you ever wish to construct a system similar to ours yourselves, we suggest that you become familiar with the others before tackling this. (Another point of note is that Facebook Page Insights only allows you to return 90 days of data in one hit, which can cause problems.)

Create look-up dictionary of metrics / dimensions you wish to download

One thing we quickly noticed is that the names of the metrics in the API do not match what you may see on the GUI you are used to using. E.g., Facebook Total Reach is page_impressions_unique; Total clicks is page_consumptions. In addition, some metrics may have been deprecated, others may not be viewable at all via the API (this was particularly true for YouTube) and, others we had to construct (e.g., all the percentages measures). Luckily, most of the actual definitions could be found on various developers’ website.

The other major problem with Facebook metrics is that when you export the data via the API, they are not all held at the same “level” of the dictionary. E.g., page_impressions_unique and page_consumptions are in one level, whereas both ‘Page likes’ and ‘Page shares’ are held inside a “sub-level” (they are both in the dictionary called page_positive_feedback_by_type, keyed by “like” and “link”, respectively, and it is this dictionary that is in the same level as page_impressions_unique et al.)

To counter all these issues, we created an dictionary look-up in an external text file whose keys were the field names we are used to seeing, e.g., Total Reach, and whose values were a list containing the name as it appeared in the API and a pointer if it was one of the “sub-directory” metrics. This gave us maximum flexible to deal with all outcomes. It also meant that to add new metrics / dimensions, change the names or the order of the export, we just simply had to alter this text file.

With the exception of the work detangling the Facebook Post Insights data, this was probably our most important insight. For it enabled us to create the relevant DataFrame with all the fields in the correct order and with the correct names with minimum of effort.

Iterate over dictionary / .json file to download all required data and turn into dataframe

As we pointed out above, data is returned from the API reporter in a dictionary / .json format. We used Python Pandas to read in this data although other software would suffice.

For each of the various reports we wished to create, we wrote similar code that followed the following general procedure.

First, we dynamically built the metrics and dimensions required for the report from the relevant external dictionary (see above). We then passed in these, plus the start and end date into the reporting module. For the problematic Facebook metrics, we split them out from the sub-dictionary, so that all the data was “flattened”. We then read the returned data into a Panda Dataframe.

This sounds simple but it only appears so, since we were able to construct the relevant metrics outside of the call to the API. Even with that caveat, it was vitally important that we exported the dictionary as a .json file and inspected it manually at each stage of the process. It is only by this manual inspection that quirks like the Facebook sub-dictionary become apparent. (Indeed, in the Facebook Ads API there are sub-sub dictionaries!)

Clean Dataframe and reformat either to export or upload to database

Once we had constructed above Dataframe, our program changed the column names to be keys from look-up dictionary, e.g., so we a header of Total Reach instead of page_impressions_unique. This gave us our a base DataFrame. If there any calculated fields, e.g., Percentage Paid Reach, we then calculated them now from this Base DataFrame. Finally, we re-ordered the columns according to the order in the external dictionary look-up.

Note again that the key was the use of the use of the external dictionary look-up. For it gave us the actual names we wished to display, their order and whether any fields are metrics or calculated.

We now had a clean data-set to either export to .csv say, or upload to a database, containing all the metrics, dimensions and other relevant information as required. To upload to the database, we used Panda SQLAlchemy although other modules would suffice. Finally, we put in on a simple scheduler to run all the relevant downloads and uploads on a daily basis.

The system operates smoothly with no user intervention required.

Paul Brown

Responsible for data analysis, modelling and automation

jumpdata.co.uk

DATA-DRIVEN APPLICATIONs and SERVICES

STATISTICAL MODELLING AND DATA ANALYTICS

Sign up for Newsletter

Leave a Reply Cancel reply