#### URL Manipulation with Regex in Python

I wanted to move all the image files used in my blog from Imgur to my own Amazon S3 bucket. I wrote a Python script to help using rclone. I've written about rclone before. It is really nifty command line utility for Cloud services. It can copy files from an URL directly to Amazon S3, Google Drive, and other cloud services without having to download the file locally first.

URL manipulation turned out to be trickier than I expected. Saving the steps involved in this post for posterity.

My blog is written in Markdown, so the first step was to extract all the lines containing Imgur URLs and save them to a file.

grep -rnHs https://i.imgur.com *.md > imglist

The next step was to extract the URLs from the imglist text. Turns out extracting URLs is trickier than expected because of the large number of variations possible. The regular expression for matching an URL is rather complex. There is a site that compares the performance of different regex for various test cases. The regex that I ended up using is from a site dedicated to finding the perfect URL regex.

import re
with open("imglist") as f:
urls = re.findall('https://i.imgur.com/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', lines) Even with that complex regex, I kept getting an annoying trailing ) on a few of the URLs. It turns out that the URL may have "weird" leading and/or trailing characters included after using the regex and needs to be trimmed off before use. urls = [url.strip('!"#$%&\'()*+,-./@:;<=>[\\]^_{|}~') for url in urls]

The next step is on account of a quirk of rclone. When using the regular copy command to copy a file from local drive to remote, rclone copies the file name as is, in fact it does not allow setting a new name on remote during copy. So one only needs to specify the remote directory or bucket name to copy the file to. However, when using the copyurl command to directly copy a web-hosted file to remote, it needs the file name to be specified, not just the remote directory.

Secondly, I wanted to retain the same file name on S3 as on Imgur, so that I would have to modify only the base of the URLs in the blog.

files = [re.findall('https://i.imgur.com/(.*)',url)[0] for url in urls]

re.findall returns a list of matching pattern strings, while we actually need strings for passing to rclone. Since we know that each URL will yield only one file name, we can add it to files using list comprehension by referencing it as [0] member of the result of re.findall.

Finally, we use the URLs and filenames to construct a command line to pass to the shell for execution. It needs the subprocess module if we would like to receive information about the result of the command execution. Apparently os.system is now not a recommended way of executing shell commands.

Make sure that rclone is properly configured to send and receive files at the cloud endpoint of interest.

for url, file in zip(urls, files):
cmdline = "rclone copyurl {} s3:<bucket-name>/img/{}".format(url, file)
process = subprocess.Popen(cmdline.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
if not error:
continue
else:
print(output, error)

Using format() to create the command line is the safest best to ensure that the command is correctly formed. Then the command line is converted into a list using split(). This is important because subprocess.Popen() or subprocess.call() expects a list as its first argument. An even better way would be to use the shlex.split() method from the shlex module, which is specifically intended for this purpose.

Finally, we have all the images uploaded to the S3 bucket. Use the search and replace function of your favourite editor to change https://i.imgur.com/ to https://<bucket-name>.s3.amazonaws.com/<folder-name>/` in all the files.