Regex detect codes in R

2395 views r
-3

I have a set of codes I want to check in my dataframe, and if they exist I want to create a column to indicate TRUE or FALSE.

  • String beginning with OO, MM, AB, HIB, POL
  • Followed by upto 5 or 9 digits

Therefore, some of the codes I have in my datafame: OO14562, MM156789076, AB1234674, HIB00000, POL112310

The dataframe is here:

df<-structure(list(Codes = structure(c(5L, 4L, 1L, 3L, 7L, 8L, 2L, 
6L), .Label = c("AB1234674", "AB13", "HIB00000", "MM156789076", 
"OO14562", "POL1123", "POL112310", "TY543"), class = "factor")), .Names = "Codes", row.names = c(NA, 
-8L), class = "data.frame")

According to the dataframe, the first 5 should return a TRUE, and the next three should be FALSE.

My code is here

gsub([OO|MM|AB|HIB|POL[0-9]{5-9})

But that is not taking me anywhere.

answered question

How about grepl("(OO|MM|AB|HIB|POL)\\d{5,9}", df$Codes)

That worked perfectly. What the hell was I doing omg.

I'll write up a quick answer

1 Answer

8

One, we need to use parenthesis not brackets to separate the letter sets. Brackets say "match one of" which is going to be unpredictable when paired with pipes.

Two, we'll use grepl because it returns a logical vector, no need to use gsub.

Three, quantity to match is specified in curly braces { }, but min and max are separated by a comma, not a dash.

You could also use [0-9] instead of \\d (any digit), but I like \\d for brevity.

And for completeness, I added ^ and $ to match the beginning and end of the string after the pattern.

This gives us:

df$check <- grepl("^(OO|MM|AB|HIB|POL)\\d{5,9}$", df$Codes)


        Codes check
1     OO14562  TRUE
2 MM156789076  TRUE
3   AB1234674  TRUE
4    HIB00000  TRUE
5   POL112310  TRUE
6       TY543 FALSE
7        AB13 FALSE
8     POL1123 FALSE

posted this

Have an answer?

JD

Please login first before posting an answer.