I have texts that has been extracted from PDFs. They look like following:
If employees canât find the time to learn, reduce the friction. Manager involvement is a critical ingredient to increase employee engagement with learning. Amplify your manager relationships. 66% 66% 66% 4 5
As you can see, sentences are split with line breaks. There are also multitude lines that has been extracted from tables and contains only numbers and special characters.
How can I get these sentences joined together with Regex? And secondly get rid of the other lines.
Python or bash like
awk would be great.
Thanks a lot
This is just a start, but the following might help:
cat file.txt | grep -E '[a-zA-Z]' | xargs
It throws away any lines that don't have at least one alphabetical character in them, and then joins them by a single space.