Lab22 - Asyncio website image scraper
Lab22 : Asyncio website image scraper
We would like to use asyncio to efficiently download all images on a given list of website and save them locally, after possible converting them to gray-scale beforehand.
The goal of this program is to run a single-threaded event loop as seen in the lecture. You should use all possible asyncio libraries (e.g. aiohttp, aiofiles, etc.) to ensure an efficient usage of the event loop and your process resources.
The Python scraper script - what to do
Your Python script must run with Python 3.12 or higher.
Python in Docker container
Instead of upgrading your Python version on your laptop, you have also the possibility to run your script in a Docker container with the appropriate version (in this case I recommend directly 3.13).
docker run -it --rm --name my-running-script \
-v "$PWD":/usr/src/myapp \
-w /usr/src/myapp python:3.13-alpine \
python your-daemon-or-script.py
Of course, you must create your onw Dockerfile if you want to use some special libraries as in the lab.
Follow these steps:
- The script reads a file containing a list of URLs.
- All these URLs must be (asynchronously) visited by your script
- The HTML must be searched for all images on these pages
- Then download all found images
- The downloaded images will then be converted to gray-scale images (if asked for)
- Finally, save (asynchronously) the images to your local disc, use the name that was given for the download
Once step 1 is done, ALL other steps should run asynchronously to ensure a high performance solution.
The script arguments
All the arguments are optional arguments. The list below describes each possible argument with its default value
| option | description | type | default |
|---|---|---|---|
--URLlist |
file (including correct management of path) that contains the URL list | string | ./urls.txt |
--nc |
A switch, if the –nc flag is given, the image won’t be converted to grayscale. | n/a | n/a |
--dest |
Destination directory for the downloaded images | string | ../images |
The URL list’s format
The file containing the URL list (UTF-8 encoded text file) must have one URL per line. The URLs can be wrong, the file can be empty.
https://regex101.com/
https://docs.python.org/3/this-url-will-404.html
https://www.sbb.ch
http://fr-cybersecurity.ch/
There is an example in the template repository of lab22.
Invocation examples
python3 image_scraper.py
Important
All arguments are optional. The image scraper script is called image_scraper.py
You must use this name!
Here are some example invocations:
# default arguments, conversion in grayscale, input file is ./urls.txt, results in images directory
python3 image_scraper.py
# same as above, but no grayscale conversion
python3 image_scraper.py --nc
# scraping the URLs in file toto.txt
python3 image_scraper.py --URLlist=toto.txt
# scraping the URLs in file toto.txt and writing to /tmp
python3 image_scraper.py --dest=/tmp --URLlist=toto.txt
Important
The script’s execution for the test battery will be done directly in the src/ directory. All imports from additional files, which are in the same directory, must be done without directory prefixes.
Additional specifications
- All additional libraries (e.g. aiohttp) must be listed in the file
pyproject.toml. With the help of this file an automatic installation of the libraries with poetry (`poetry install) should be possible - Consider using an existing library (e.g. Beautiful Soup) for parsing the HTML part and finding the links for the images
- Same applies for the gray-scaling processing (e.g. OpenCV or Pillow)
- You should follow redirections (e.g. https://cutt.ly/concurp2-2425 → https://concurp2-lecture.kube-ext.isc.heia-fr.ch) to get the images of the final destination. However, you should not follow links on the web page to other pages. The redirection are defined with a result: 301 Moved Permanently
- You must provide a correct exception handling. Try to only stop coroutines that have exceptions and let the others coroutines continue their work
Submission and deadline
The code must be committed to the main branch of your group git repository. The teacher will not search for more recent versions in other branches.
Commit and comments
Your comments in the code and your comments in the commits forms your report! So the expectation is that these comments are self-explanatory and in-depth. Commit often and describe precisely what your were doing.
10 coding commandments
Don’t forget the coding commandments. The grading will also be based quite strongly on this. Here again the link 10 coding commandments
Teams of Two and Deadline
The last commit that is accepted should be done on Wed, 10.06.2026 - 23h59, P16.
Teams of Two
- Groupe 01 / lab22-a (bertrand.marmy, elvin.kuci, mailys.heimgart)
- Groupe 02 / lab22-b (matteo.membrez, bastien.bussard)
- Groupe 03 / lab22-c (noe.henchoz, yoan.gilliand)
- Groupe 04 / lab22-d (remi.carrard, axel.buro)
- Groupe 05 / lab22-e (robin.golliard, orell.wandji)
- Groupe 06 / lab22-f (luc.vogt, elioamen.thalmann)
- Groupe 07 / lab22-g (simon.losey, ismael.castella)
- Groupe 08 / lab22-h (danilo.anzile, julien.gumy)
- Groupe 09 / lab22-i (rodrigo.carracor, tom.yerly)