How to clean OCR'd text with Regex

893 views python
-2

I have texts that has been extracted from PDFs. They look like following:

If employees can’t find      
the time to learn, reduce  
the friction. Manager involvement is a critical ingredient to 

increase employee engagement with learning.

Amplify your manager  
relationships. 

66% 66% 66%

4 5

As you can see, sentences are split with line breaks. There are also multitude lines that has been extracted from tables and contains only numbers and special characters.

How can I get these sentences joined together with Regex? And secondly get rid of the other lines.
Solutions in Python or bash like awk would be great.
Thanks a lot

answered question
Chris

Why do you want to do this "with Regex"? Not every string manipulation problem is a regular expression problem.

Chris

Also, what have you tried? How did it fail? We're not here to write code for you. Please read about what's on-topic in the help center. This is far too broad. Please read How to Ask.

Add a Comment

1 Answer

1

This is just a start, but the following might help:

cat file.txt | grep -E '[a-zA-Z]' | xargs

It throws away any lines that don't have at least one alphabetical character in them, and then joins them by a single space.

posted this

Please login first before posting an answer.