Pattern Matching In Strings Using Regexes.

Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expressions as a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and returns chunks of text back for further manipulation.

There are three main reasons you would want to do this :

  • To check whether a pattern exists within some source data
  • To get all the instances of a complex pattern from some source data
  • To clean your source data using pattern generally through string splitting.

Regexes are a foundational technique for data cleaning in data science and a solid understanding of regexes will help you quickly and efficiently manipulate text data for further data science application.

Let’s see how regex works

First we’ll import the ‘re’ module, which is where python stores regular expressions. There are several main processing functions in ‘re’ like match() checks for a match that is at the beginning of the string and returns a boolean . Similarly, search() checks for a match anywhere in the string and returns a boolean.

The split() functions uses a pattern to split the given string and return a list of substrings, findall() will look for a pattern and pull out all the occurrences.

Now lets see some more complex example. The regex specifications standard defines a markup language to describe pattern in text. The caret character ‘^’ means start and the dollar sign ‘$’ means end. If we put ^ before a string, its means that the text that regex processor retrieves must start with the string we specify. Similarly, when we put ‘$’ after the string, it means that the text Regex retrieves must end with the string we specify.

re.search() returned a new object called re.Match object which has a boolean value and rendering of the match object also tells us what pattern was matched and the location of matched pattern as the span.

Lets see character classes, Let’s take a string of a single learner grades over a semester i.e. grades = “ACAAAABCBCBAA”. I f we wanted to count number of A’s and B’s in the list we’ll use a set operator “[]”.

If we want all instances where this student receive an A followed by a B or a C. We can write this using set operator “[]” or by using pipe operator “|”, which means OR.

Now lets move on to quantifiers. Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m, n} where ‘e’ is the expression or a character we want to match, ‘m’ is the minimum number of times you want it to be matched and ‘n’ is the maximum number of times the item could be matched.

Let’s use the above grades as an example. How many times has this student been on a back-to-back A’s streak? or if we want to see decreasing trend in a student’s grades.

There are other quantifiers that are used as short hands, an asterix ‘*’ to match 0 or more times , a question mark ‘?’ one or more times.

This is just an overview of regular expressions, and really we’ve just scratched the surface of what we can do with regexes. They’re incredibly powerful. If you want to learn about them then you can refer to python documentation for regex.

Thankyou for reading!

If you find this blog useful, give it a clap : )

If you're trying, you're already winning!🥂

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store