Pattern Matching

Using Regexes For Pattern Matching In Strings

Aman Jamshed
TheLeanProgrammer

--

Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of regular expressions as a pattern that you give to a regex processor with some source data. The processor then parses that source data using that pattern and returns chunks of text back for further manipulation.

There are three main reasons you would want to do this :

  • To check whether a pattern exists within some source data
  • To get all the instances of a complex pattern from some source data
  • To clean your source data using patterns generally through string splitting.

Regexes are a foundational technique for data cleaning in data science and a solid understanding of regexes will help you quickly and efficiently manipulate text data for further data science application.

Let’s see how regex works

First we’ll import the ‘re’ module, which is where python stores regular expressions. There are several main processing functions in ‘re’ like match() checks for a match that is at the beginning of the string and returns a boolean . Similarly, search() checks for a match anywhere in the string and returns a boolean.

The split() functions use a pattern to split the given string and return a list of substrings, findall() will look for a pattern and pull out all the occurrences.

Now let's see some more complex examples. The regex specifications standard defines a markup language to describe patterns in the text. The caret character ‘^’ means start and the dollar sign ‘$’ means end. If we put ^ before a string, its means that the text that the regex processor retrieves must start with the string we specify. Similarly, when we put ‘$’ after the string, it means that the text Regex retrieves must end with the string we specify.

re.search() returned a new object called re.Match object which has a boolean value and rendering of the match object also tells us what pattern was matched and the location of matched pattern as the span.

Let's see character classes, Let’s take a string of single learner grades over a semester i.e. grades = “ACAAAABCBCBAA”. If we wanted to count the number of A’s and B’s in the list we’ll use a set operator “[]”.

If we want all instances where this student receives an A followed by a B or a C. We can write this using set operator “[]” or by using pipe operator “|”, which means OR.

Now let's move on to quantifiers. Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m, n} where ‘e’ is the expression or a character we want to match, ‘m’ is the minimum number of times you want it to be matched and ‘n’ is the maximum number of times the item could be matched.

Let’s use the above grades as an example. How many times has this student been on a back-to-back A’s streak? or if we want to see decreasing trend in a student’s grades.

There are other quantifiers that are used as shorthands, an asterisk ‘*’ to match 0 or more times, a question mark ‘?’ one or more times.

This is just an overview of regular expressions, and really we’ve just scratched the surface of what we can do with regexes. They’re incredibly powerful. If you want to learn about them then you can refer to python documentation for regex.

Thank you for reading!

If you find this blog useful, give it a clap : )

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--