How do I generate the movies's page?
<2025-08-16 Sat>
The section movies of my web page has reached out 700 movies. So… it is a perfect moment to show how I generate it.
First version
For the record, the first version was totally manual. I used the data provided by LetterBoxd, as you can see on the image below. I edited the CSV with a python script to get the org table then I uploaded it to here. A manual copy&paste.
It was an effective way of achieving my initial goal, but to be honest, I didn't like it. You and my engineer's mind know that this process must be automated.
NOTE: I don't have the script :c
Second version
The new version is a fully automated Python script that is run by a cron job (for more information check /etc/cron.d/website
).
The next code was written with IA (ChatGPT). I shared with ChatGPT my first version and some tweaks I wanted, then I edited to achieve what I want.
What is about the API?
LetterBoxd doesn't offer a free or paid API for personal projects, so I wrote a simple scraper to retrieve my movie history, minimising the number of requests to avoid bans or increased server loads.
The code
Here are the libraries I use for the scrapper.
import re # Regular expressions for pattern matching and text processing import os # Operating system interface (file paths, environment, etc.) import time # Time-related functions (sleep, timestamps, etc.) import requests # HTTP requests to interact with web resources import fileinput # Read or modify files line by line from bs4 import BeautifulSoup # HTML/XML parsing and web scraping from datetime import datetime # Date and time manipulation
I defined five variables:
- base_url: This defines the URL where the script will search for movies.
- last_movie_url: This defines the URL to retrieve the date on which a movie was watched.
- output_file: This variable defines a temporary file called 'movies_list.txt'.
- status: The script sometimes stops working because Letterboxd updates their page, changing class names and IDs. I use this variable "to send" a notification.
base_url = "https://letterboxd.com/rhyloo/films/by/date/page/" last_movie_url = "https://letterboxd.com/rhyloo/films/diary/" output_file = "/tmp/movies_list.txt" status = True
get_movie_names
is the core of the scraper. It looks for the <li>
tag and the griditem
class to find the movies. As I said, it usually fails on the class name because Letterboxd changes it (this has happened to me at least three times). I clean the names to remove any strange characters, too.
def clean_name(name): """Remove unwanted characters from a movie name, keeping punctuation like :,.?!"-""" return re.sub(r'[^\w\s:,.?!"-]', '', name).strip() def get_movie_names(page_url): """Fetch movie names from a page and return a list of cleaned titles.""" try: status = True response = requests.get(page_url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') # Buscar contenedores de pósters poster_containers = soup.find_all('li', class_='griditem') movie_names = [] for container in poster_containers: img_tag = container.find('img') if img_tag and img_tag.get('alt'): movie_names.append(clean_name(img_tag['alt'])) return movie_names except Exception as e: status = False return []
Here are some more auxiliary functions. get_last_date_movie
simply retrieves the date on which I last watched a movie, while write_to_file
saves new movies to a file. Finally, split_into_row
creates a list of lists fixed to three.
def get_last_date_movie(): """ Fetch the date of the last movie from a webpage. Returns: str: The date in format 'DD-MM-YYYY' if found. "Opsss!": If the date element is missing. None: If an error occurs (e.g., network issue, invalid HTML). """ try: response = requests.get(last_movie_url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') date_link = soup.find(class_='daydate') if date_link: date_link = re.search(r'(\d{4}).(\d{2}).(\d{2})', str(date_link)) year = date_link.group(1) month = date_link.group(2) day = date_link.group(3) last_movie = day + '-' + month + '-' + year return last_movie else: return "Opsss!" except Exception as e: return None def write_to_file(file_path, movie_names): """ Write movie names to a file. - If the file exists and is not empty, replace its first line with new movie names. - If it doesn't exist or is empty, create it and write all names. """ if os.path.exists(file_path) and os.path.getsize(file_path) > 0: # Edit the existing file in-place (replace the first line with new movies) with fileinput.input(file_path, inplace=1) as f: for xline in f: if fileinput.isfirstline(): for element in movie_names: print(element) else: print(xline.strip('\r\n')) else: # Create a new file and write movie names line by line with open(file_path, 'w', encoding='utf-8') as file: for name in movie_names: file.write(name + '\n') def split_into_rows(movies): """ Divide a list of movies into rows of a fixed number of columns (default 3 per row). """ rows = [] while movies: row = movies[:3] rows.append(row) movies = movies[3:] return rows
This is where the main part of the script starts. It takes the last movie saved in the movies_list.txt
file. With a loop, it iterates over each page and checks if the last movie saved in the file is already in the new movies list. If so, it stops.
# MAIN if os.path.exists(output_file): with open(output_file, 'r', encoding='utf-8') as myfile: first_line = myfile.readline() else: first_line = "" page_number = 1 all_movie_names = [] while True: page_url = base_url + str(page_number) + "/" movie_names = get_movie_names(page_url) if not movie_names: break all_movie_names.extend(movie_names) if first_line.strip('\n') in all_movie_names: target_index = all_movie_names.index(first_line.strip('\n')) all_movie_names = all_movie_names[:target_index+1] break page_number += 1 time.sleep(30) write_to_file(output_file, all_movie_names)
Finally I create an .org
file, which is exported to HTML using org-publish
.
chars_to_be_delated = ' ' with open(output_file, 'r') as myfile: contenido = myfile.read() contenido_filtrado = ''.join(c for c in contenido if c not in chars_to_be_delated) with open(output_file, 'w') as myfile_filtered: myfile_filtered.write(contenido_filtrado) actual_date = datetime.now().strftime("%Y-%m-%d %a") with open(output_file, 'r', encoding='utf-8') as myfile: movies = [linea.strip() for linea in myfile.readlines()] max_movie = len(movies) num_digits = len(str(max_movie)) movies_size_column = f"{num_digits}em" rows = split_into_rows(movies) last_movie_date = get_last_date_movie() if status: html_status = f"@@html:<div class=\"warning\">✅ The script is working.</div>@@" else: html_status = f"@@html:<div class=\"warning\">⚠️ The script is not working.</div>@@" org_mode = f"""#+TITLE: MOVIES #+author: J. L. Benavides #+date: <{actual_date}> #+last_modified: [{actual_date}] #+DESCRIPTION: Personal blog - Jorge Benavides Macías - Watched movies #+OPTIONS: toc:nil num:nil ^:nil I am a film lover, I really enjoy watching films very much. It's a hobby I can do alone or with others. Actually I have watched more than {max_movie-max_movie%100} films, below is the full list since I started registering them on [[https://letterboxd.com/][letterboxd]].\n\nI used to update this content [[file:how-do-i-get-movies-draft.org][manually]], but now it is [[file:how-do-i-get-movies-draft.org][automatic]], so it's always up to date.\n\nBy the way my user is [[https://letterboxd.com/rhyloo/films/by/date/][Rhyloo]]. {html_status} @@html:<div class="last-update">Last update: {actual_date}</div>@@ @@html:<div class="last-update">Total movies watched: {max_movie}</div>@@ @@html:<div class="last-update">Date last movie: {last_movie_date}</div>@@ #+ATTR_HTML: :class table-movies :style "grid-template-columns: {movies_size_column} 1fr;" | <c> | <c> | <c> | <c> | <c> | <c> | | No. | Movie Name | No. | Movie Name | No. | Movie Name | |-----+--------------------------------------------------------+-----+----------------------------------------------------------------------+-----+-------------------------------------------------------------------------------------| """ number = 1 for row in rows: row.extend([''] * (3 - len(row))) org_mode += f"| {number:<3} | {row[0]:<50} |" if row[1]: org_mode += f" {number+1:<3} | {row[1]:<68} |" else: org_mode += f" | |" if row[2]: org_mode += f" {number+2:<3} | {row[2]:<80} |\n" else: org_mode += f" | |\n" number += 3 with open(output_org, 'w', encoding='utf-8') as myfile_org: myfile_org.write(org_mode)
Final thoughts
I'm not particularly proud of this code. I didn't write it entirely, so it's a mess, like the famous spaghetti code. I could simplify it, or at least clean up the code to make it more readable.
Maybe I can remove the functions… in this code they are not necessary because I don't reuse them. Something sequential like:
- Step ONE: Get data
- Step TWO: Save data
Also, I find Python a bit of a headache. The code never feels clean to me; it's just a bunch of lines into one file with un konwn libraries. I prefer C.
Ghostbusters/Transcript Modified
Are you troubled by strange noises in the middle of the night?
Do you experience feelings of dread in your basement or attic?
Have you or any of your family ever seen a spook, specter or ghost?
If the answer is yes, then don't wait another minute. Pick up your phone and call the professionals.
Contact at me at rhyloo dot com!