Chapter 9 of 12:
Well hello again!
In this week’s blog post, I am going to take a little bit of a different approach to what I learned in my CXL Digital Analytics mini-degree program. This principle was more mentioned and lightly touched on than actually explored in Chris Mercer’s Google Analytics and Google Tag Manager classes and I think that it’s importance is vital enough to a digital analyst that it should have its very own blog post. This principle, or language actually, is the language of Regular Expression (more commonly known as “RegEx”).
What is Regular Expression? Well, RegEx is a language that describes or recognizes patterns in strings or text. They are commonly used in matching patterns or search and replace functions on strings. Let me give you an example of when RegEx might come in handy in your digital analytical endeavors. Say you are asked by a stakeholder to gather metrics on several product web pages that have very little in common aside from a “/products/” page path.
There are a handful of ways that you could fulfill this request. You could look up each page individually and manually then find a way to stitch that together (this is a VERY time consuming solution that I wouldn’t recommend even to my worst enemy). Very tedious. You could use the common /products/ page path, but then that would result in every products page and you would still have to filter individually until you found the specific list of product pages requested and then stitch that together — this method is less time consuming and a little less tedious than what I previously mentioned, but still cumbersome nonetheless. What I would recommend to a newbie learning Regular Expression would be to use these three specific characters.
- The caret “^”
- The bar “|”
- The dollar sign “$”
You’re probably wondering what’s so special about a caret, bar, and dollar sign, right? Well, in RegEx speak the caret “^” specifies that wherever you place the ^ is where the pattern starts. The bar “|” character is handy in that it provides a bit of logic to what you’re looking for and in Regular Expression translates to “or”. Lastly, the dollar sign “$” indicates where the pattern stops. Clear as mud, still? Allow me to demonstrate what I mean. Imagine that this is the list of product pages that the stakeholder requested:
The simplest way to put this list together using the “^”, “|”, and “$” would be something like this:
Not super elegant, but those three characters saved you about 30 to 60 minutes of work in gathering individual pages and stitching them all together to present to your stakeholder. Even if those three RegEx characters (“^”, “|”, “$”) where all that you ever cared to learn, they will make your job significantly easier and less time consuming than if you didn’t know them.
Now, if you’re interested in taking your Regular Expression skills to the next level, allow me to introduce the next four characters that when combined with what you learned above will make you look like a RegEx rockstar! Enter…
- The period “.”
- The asterisk “*”
- The parentheses “()”
- The backslash “\”.
The period “.” is a metacharacter that will find any single alpha-numeric or special character (except for RegEx specific characters). The asterisk “*” is what’s called a RegEx quantifier and indicates a character occurred zero or more times. The parentheses operate how you might think they would in that they group patterns together similar to how they operate with mathematical functions. But say that you have a RegEx character that you need to include in the pattern (e.g., the “.” that occurs twice in the hostname of each URL, how do you include that in the string pattern sans it’s RegEx functionality? This is where the backslash “\” comes into play. The backslash in RegEx is used to “escape” or negate the RegEx character’s special pattern recognition powers and make it just another string character. Using these new characters along with the caret, bar, and dollar sign let’s see how we can distill further the patterns in the six pages above:
Do you see what I did there? It’s a lot, so let me break it down. You have the beginning pattern of the “www.domain.com” hostname portion occurring six times. You only need to say that that’s where the pattern starts and remove the “.” RegEx functionality that occurs twice within the hostname, resulting in: ^www\.domain\.com. Next we have the common page path “/products/” that like the hostname occurs six times, no need to repeat the sextuplet. RegEx allows you to trim it down to once. Here comes the more complex part in the next page path of the product names. Two of the URLs have a product name starting with the word “product” and the other four have product at the end of the product name. Using parentheses you can create two product name groups (product-a & product-def) and (abc-product, xyz-product, qrs-product, and lmn-product). Do you see the next pattern emerging? In the first group you have “product” occurring twice and the “a” and “def” are the differing parts, using parentheses and the “|” you can create the nice and tidy (product\-(a|def)) group. The next pattern you’re seeing is that “product” occurs at the end of the second group four times and the differing parts occur at the beginning. Let’s use those parentheses and bars to tidy that pattern up like so: ((abc|xyz|qrs|lmn)-product). Now, to deal with the trial, demo, and contact-us slugs…you guessed it, parentheses and bars to the rescue (trial|demo|contact-us) then slap a “$” on that bad boy and call it a day!
There’s a sort of neat and tidy elegance to RegEx that just makes sense once you get the hang of the functionality of each special Regular Expression character. The coolest thing about RegEx is the diversity in its utility. You can use it to identify phone number patterns, SSNs, email addresses, and other PII that you will need to identify and redact for data privacy compliance. You can use it to simplify triggers in Google Tag Manager, or create filters in Google Analytics. You can use it to identify and modify strings in Excel or Google Sheets. You can use it to customize your dimensions and data visualizations in Google Data Studio. The uses of this language are boundless…I promise, if you’re willing to take the time to learn it and practice it, you will never regret it!
Here are a few RegEx references for you to get started: