Skip to content

Lab22 - Asyncio website image scraper

Lab22 : Asyncio website image scraper

We would like to use asyncio to efficiently download all images on a given list of website and save them locally, after possible converting them to gray-scale beforehand.

The goal of this program is to run a single-threaded event loop as seen in the lecture. You should use all possible asyncio libraries (e.g. aiohttp, aiofiles, etc.) to ensure an efficient usage of the event loop and your process resources.

The Python scraper script - what to do

Your Python script must run with Python 3.12 or higher.

Python in Docker container

Instead of upgrading your Python version on your laptop, you have also the possibility to run your script in a Docker container with the appropriate version (in this case I recommend directly 3.13).

docker run -it --rm --name my-running-script \
    -v "$PWD":/usr/src/myapp \
    -w /usr/src/myapp python:3.13-alpine \
    python your-daemon-or-script.py
It should also work with a Python 3.12 version.

Of course, you must create your onw Dockerfile if you want to use some special libraries as in the lab.

Follow these steps:

  1. The script reads a file containing a list of URLs.
  2. All these URLs must be (asynchronously) visited by your script
  3. The HTML must be searched for all images on these pages
  4. Then download all found images
  5. The downloaded images will then be converted to gray-scale images (if asked for)
  6. Finally, save (asynchronously) the images to your local disc, use the name that was given for the download

Once step 1 is done, ALL other steps should run asynchronously to ensure a high performance solution.

The script arguments

All the arguments are optional arguments. The list below describes each possible argument with its default value

option description type default
--URLlist file (including correct management of path) that contains the URL list string ./urls.txt
--nc A switch, if the –nc flag is given, the image won’t be converted to grayscale. n/a n/a
--dest Destination directory for the downloaded images string ../images

The URL list’s format

The file containing the URL list (UTF-8 encoded text file) must have one URL per line. The URLs can be wrong, the file can be empty.

https://regex101.com/
https://docs.python.org/3/this-url-will-404.html
https://www.sbb.ch
http://fr-cybersecurity.ch/

There is an example in the template repository of lab22.

Invocation examples

python3 image_scraper.py

Important

All arguments are optional. The image scraper script is called image_scraper.py

You must use this name!

Here are some example invocations:

# default arguments, conversion in grayscale, input file is ./urls.txt, results in images directory 
python3 image_scraper.py

# same as above, but no grayscale conversion
python3 image_scraper.py --nc

# scraping the URLs in file toto.txt 
python3 image_scraper.py --URLlist=toto.txt

# scraping the URLs in file toto.txt and writing to /tmp
python3 image_scraper.py --dest=/tmp --URLlist=toto.txt

Important

The script’s execution for the test battery will be done directly in the src/ directory. All imports from additional files, which are in the same directory, must be done without directory prefixes.

Additional specifications

  • All additional libraries (e.g. aiohttp) must be listed in the file pyproject.toml. With the help of this file an automatic installation of the libraries with poetry (`poetry install) should be possible
  • Consider using an existing library (e.g. Beautiful Soup) for parsing the HTML part and finding the links for the images
  • Same applies for the gray-scaling processing (e.g. OpenCV or Pillow)
  • You should follow redirections (e.g. https://cutt.ly/concurp2-2425https://concurp2-lecture.kube-ext.isc.heia-fr.ch) to get the images of the final destination. However, you should not follow links on the web page to other pages. The redirection are defined with a result: 301 Moved Permanently
  • You must provide a correct exception handling. Try to only stop coroutines that have exceptions and let the others coroutines continue their work

Submission and deadline

The code must be committed to the main branch of your group git repository. The teacher will not search for more recent versions in other branches.

Commit and comments

Your comments in the code and your comments in the commits forms your report! So the expectation is that these comments are self-explanatory and in-depth. Commit often and describe precisely what your were doing.

10 coding commandments

Don’t forget the coding commandments. The grading will also be based quite strongly on this. Here again the link 10 coding commandments

Teams of Two and Deadline

The last commit that is accepted should be done on Wed, 10.06.2026 - 23h59, P16.

Teams of Two