Keep the middle words in a phrase separated by dashes in R using gsub

1482 views r
-5

I have the following:

x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

I want to keep "Paulista", "Mineiro", "Carioca"

I'm trying gsub like

y <- gsub("\\$-*","",x)

but is not working.

answered question

2 Answers

8

We can do this with a single call to sub:

x <- c(" Sao Paulo - Paulista - SP",
       "Minas Gerais - Mineiro - MG",
       "Rio de Janeiro - Carioca -RJ")

sub("^.*-\\s+(.*?)\\s+-.*$", "\\1", x)
[1] "Paulista" "Mineiro"  "Carioca"

The idea is to capture whatever occurs in between the two dashes in each location.

^.*-\\s+   from the start, consume everything up to and including the first dash
(.*?)      then match and capture everything up until the second dash
\\s+-.*$   consume everything after and including the second dash

posted this
1

Two quick methods:

x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

This first is the standard sub solution; if there are strings without the hyphens, it will return the full strings unmodified.

trimws(sub("^[^-]*-([^-]*)-.*$", "\\1", x))
# [1] "Paulista" "Mineiro"  "Carioca" 

This next one works by splitting the string by "-" into a list, which is then indexed for the second element. If there are strings without hyphens, this will error with subscript out of bounds.

trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro"  "Carioca" 

Both solutions use trimws to reduce the leading and trailing whitespace.

posted this

Have an answer?

JD

Please login first before posting an answer.