Custom Dimensions in Google Analytics - An Approach in R

by Stephon Beaufort Wednesday, December 19, 2018

Businesses leverage analytics tools like Google Analytics to get important insights about their website to make more informed decisions. Sometimes though, the stock reports aren't good enough, and outside information needs to be brought in to add value the data that GA collects.

Google Analytics Custom Dimensions

Google Analytics has custom dimensions just for this issue. As a refresher, dimensions are attributes of your web data. Typically non-numeric, the more common dimensions are ones like 'landing page' 'source' and 'browser'. Custom dimensions then, are dimensions that can be added to Google Analytics and viewed alongside the stock data in the UI.

It often happens that a business' objectives are represented in their site structure. At Alloy, we wanted to know how each author's blogs were performing. This is not a unique concern; blogging is a common lead-generation tactic, and keeping authors aware of how their content is performing in the wild can be great for boosting morale.

Equipped with a goal in mind, it's easy to be eager about quickly adding your custom data to GA. However:

google analytics screenshot

If you don't have easy access to a developer with the knowledge to implement the tracking changes you want, setting up outside data to be integrated into Google Analytics could potentially become expensive and time-consuming.

Using Google Analytics with R

Enter R. If you're not using R to analyze Google Analytics data, there are quite a few tutorials out there for getting started. It's particularly worth querying GA in R if you find yourself frequently exporting reports. There are a number of useful tools for web scraping in R, which is just what we needed to grab author information from our blog pages.

library(tidyverse)
library(rvest)
library(googleAnalyticsR)
ga_auth()

We start by querying the Google Analytics data we want. Here, the only metrics I want to see are sessions for each blog. Luckily, our url structure is such that every blog has "blog/" in its url.

organic <- segment_ga4("organic",segment_id = "gaid::-5")
dim <- dim_filter("landingPagePath","REGEXP","blog/") %>% list() %>% filter_clause_ga4()
pages <- google_analytics(ga_id, #your id here
                 date_range = range, #your date range
                 dimensions = c("landingPagePath"),
                 metrics = c("sessions"),
                 dim_filters = dim,
                 segments = organic)

Google Analytics will return a list of url paths. we want to append them to the host name to get the author name from the url.

pages <- pages %>% mutate(page = paste0("https://www.alloymagnetic.com",landingPagePath))

To retrieve the the author names, the rvest package comes in handy. The author can be identified by its CSS selector.

pages <- pages %>% mutate(author = map(page,~try(read_html(.) %>% 
                                           html_node("span.author") %>% 
                                           html_text())) %>% unlist())

It's likely that you'll get status errors for some of the pages, particularly if some of your blogs have been unpublished in your date range.

Finally, we group the landing page traffic by author to get our list of writers. With a few lines of code, we were able bring in author name as our own 'custom dimension' to make better use of our Google Analytics data.

pages %>% group_by(author) %>% summarise(sessions = sum(sessions),
                                         `# of blogs` = n()) %>%
  filter(!str_detect(author, "404")) %>%
  mutate(author = str_remove(author, " by ")) %>%
  arrange(desc(sessions))