How to replace several URLs in text with python?

3832 views python
6

I have XML with many weblinks which are URL encoded. I can't to use this XML before I decode all weblinks in it.

I have written such code in python:

import re
from urllib.parse import unquote
from transliterate import translit, get_available_language_codes

myString = """><tr><td ><a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" ><img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" >;<br /><a name='more'></a><br /><br /><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""
b = re.findall("(?P<url>https?://[^\s]+)", myString)
c = unquote(unquote(b))
d = translit(c, 'ru', reversed=True)

Now I can: 1. Decode any of link separately 2. Create an array of decoded links

But I have no ideas how can I replace in myString all encoded links (default one) by those which where decoded by me.

answered question

Do you need a regex here? html.unescape(myString) should get you most of the way...

Thank you for a comment. It doesn't give proper links unfortunately.

Nope - there's still more work to be done but I'm guessing that's getting closer to what you're after?

1 Answer

7

You can use html.unescape to get your string more readily parseable, then use BeautifulSoup4 (pip install bs4) to find all the anchor tags and double unquote the href attribute for those, eg:

from html import unescape
from urllib.parse import unquote
from bs4 import BeautifulSoup

myString = """&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""

decoded_urls = [
    unquote(unquote(a['href']))
    for a in BeautifulSoup(unescape(myString), 'html.parser').find_all('a', href=True)
]

Gives you:

['https://somewebsite.com/s1600/????????+??%?.jpg']

You'll probably have to tweak things here to suit your data/extract requirements, but that's the general principle.

posted this

Have an answer?

JD

Please login first before posting an answer.