Engineering, data and two butterfly effects.

<2025-08-13 Wed>

For the last two weeks, I had been dealing with my personal Roman Empire of accounting. So after updating my accounts –which were four months out of date– and changing the accounting software from GnuCash to Hledger (advantages of plain text accounting), I had the idea to analyse all my grocery receipts.

The task is fairly straightforward from a data standpoint:

A large number of receipts in PDF format (database).
Extract the data and do some parsing.
Plot it and perform some basic statistical analysis. (Spoiler: I just calculated the total expenses by month, at least for this post.)

Here is an example of the data I am talking about: 20250809 Mercadona 122,05 €.pdf.

As you can see, the file size is 36–37 kB and the file name includes the date, supermarket and total amount. Well, I have 148 files of fun :D!

So let's start the mini-project.

Init

After a long Google search ‒https://www.google.com/search?q=pdf+extract+text+python‒ and reading the first result: Reddit: What’s the Best Python Library for Extracting Text from PDFs?, I decided to use PyMuPDF.

import pymupdf

Then I wrote this code to open the file and to print the extracted data. I simplified the output with dots "…" for the blog. If you execute this code, you will get a long version.

doc = pymupdf.open("./20250809 Mercadona 122,05 €.pdf")
for page in doc:
    content = page.get_text()
    print(content)

MERCADONA, S.A.   A-46103834
...
3 FONT NATURA 8L
0,92
2,76
...
1 SIN LACT DESN NAT
1,35
...
1 SANDIA BAJA SEMILLAS
7,625 kg
0,55 €/kg
4,19
...

This print provided me the keys to write the core of this "extractor":

I can split the data using the NewLine character (\n).
I can get the date from the receipt.
I can remove the first elements, maybe 11.
I can remove the last elements from "TOTAL (€)" to the end.
There are at least three cases of data, for now.
- Quantity (in Spanish "cantidad") is greater than 1, so there will a unit price.
- Quantity is equal to 1.
- Quantity is equal to 1, but there is a €/kg.

The core

Handling the split is so easy in Python, with just this line:

content = content.split('\n')

Get the date from the receipt. This line is a big hardcore and maybe produce troubles in future. What would happen if the index changed?, or if there were spaces?

date = content[5]

Remove any unnecessary data. Hardcode the first part.

content = content[11:content.index("TOTAL (€)")]

With all these lines, we got the following output:

doc = pymupdf.open("./20250809 Mercadona 122,05 €.pdf")
for page in doc:
    content = page.get_text()
    content = content.split('\n')
    date = content[5]
    content = content[11:content.index("TOTAL (€)")]
    print(content)

['3 FONT NATURA 8L', '0,92', '2,76', '1 GELLY 0% SABORES', '1,00', '1 CABALLA FTE OLIV BPK', '2,20', '1 SIN LACT DESN NAT', '1,35', '1 HIERBABUENA 40 GR', '1,35', '1 FIDEO CABELLO', '0,80', '1 NUEZ NATURAL', '2,70', '1 JAMON COC.BRASEADO', '2,25', '1 APIO VERDE', '0,99', '1 COLG. TRIPLE ACCION', '1,95', '1 FILETE MELVA OLIVA', '1,85', '1 HARINA PARA BIZCOCHO', '1,05', '1 ATUN CLARO NATURAL', '4,20', '1 FILETE CABEZA LOMO', '4,53', '1 HARINA DE PIZZA', '1,45', '1 COPOS DE AVENA', '1,35', '1 QUESO FRESCO', '2,30', '1 MANTEQUILLA S/LACTOS', '2,65', '3 ARROZ LARGO', '1,35', '4,05', '1 SAL FINA', '0,40', '1 ACEITUNA S/HUESO P3', '1,70', '1 LENTEJA PARDINA', '2,10', '1 COSTILLA CHURRASCO', '4,21', '1 SALMOREJO FRESCO', '2,90', '2 GARBANZO M.COCIDO', '0,75', '1,50', '1 24 HUEVOS FRESCOS', '4,75', '1 ESPINACAS', '1,24', '1 PAPEL HUMEDO WC', '1,55', '1 POLLO ENTERO LIMPIO', '7,88', '1 LEJIA DETERG.LIMON', '1,15', '1 PIMIENTA NEGRA MOLID', '1,30', '1 PROT.NORMAL PLEGADO', '1,10', '1 VELA TALCO', '1,85', '2 ALUBIA BLANCA COCIDA', '0,80', '1,60', '1 CANELA RAMA', '1,50', '1 VINAGRE LIMPIEZA', '0,90', '1 HARINAMAIZ PRECOCIDA', '1,65', '1 ICEBERG', '1,03', '2 PAN CENTENO 51 %', '1,47', '2,94', '1 CEPILLO + ACTION', '1,50', '1 MAIZ DULCE PACK-3', '1,60', '1 IMPULSOR ROYAL 80GR', '2,25', '1 HIGIENICO DOBLE ROLL', '4,50', '1 PERCARBONATO BLANQ.', '1,65', '2 LECHE SEMI', '0,88', '1,76', '1 CURCUMA', '1,35', '1 40 B.CIERRA FÁCIL', '1,65', '1 C. FRESA EXTRA 0%', '1,60', '1 GEL CREMA', '3,95', '1 ROLLO COCINA JUMBO', '3,65', '1 INFUSION JENGIBRE', '1,35', '1 DEO CREMA', '1,60', '1 FRESH 0%ALCOHOL ENJ.', '1,35', '1 QUESO RALLADO S/LACT', '1,80', '1 SANDIA BAJA SEMILLAS', '7,625 kg', '0,55 €/kg', '4,19', '1 LIMON', '0,436 kg', '3,15 €/kg', '1,37', '1 BANANA', '0,644 kg', '1,40 €/kg', '0,90']

Now, in pursuit of using Pandas, I have considered the following columns for splitting the data:

File

Date

Quantity

Description

Unit Price

Total Price

One row per product in a large DataFrame.

The param that lets me handle different cases is Quantity with a regular expression (needs the re library) and Python operators. This gives me Quantity and Description. This allows me to differentiate between the three cases listed above.

import re
Quantity = re.search(r'^\d\s', content[0]).group().replace(" ", "")
Description = content[0][len(Quantity)+1:]
print(content[0])
print(Quantity)
print(Description)

3 FONT NATURA 8L
3
FONT NATURA 8L

To handle these three cases, I analysed the possible descriptions.

If Quantity is greater than 1, there will be three elements:
Description (previously split), Unit Price as element in the list and Total Price.
If Quantity is equal to 1, there are two options:
1. It is a type of fruit or vegetable with a unit price in euros per kilogramme. In this case, it will have four elements:
  Description, kilograms, euro per kilogram and total price.
2. It has one element so the total price is the unit price. It will have two elements::
  Description and Total Price.

This scheme, plus Pandas, forms the basis of the extractor.

import pandas as pd

df = pd.DataFrame()

doc = pymupdf.open("./20250809 Mercadona 122,05 €.pdf")
for page in doc:
    content = page.get_text()
    content = content.split('\n')
    Date = content[5]
    content = content[11:content.index("TOTAL (€)")]

    index = 0
    while index < len(content):

        Quantity = re.search(r'^\d\s', content[index]).group().replace(" ", "")
        Description = content[index][len(Quantity)+1:]

        if int(Quantity) > 1:
            UnitPrice = content[index+1]
            TotalPrice = content[index+2]
            index += 3
        else:
            regexp = re.compile(r'\d\skg$')
            if regexp.search(content[index+1]):
                UnitPrice = content[index+2]
                TotalPrice = content[index+3]
                index += 4
            else:
                UnitPrice = content[index+1]
                TotalPrice = content[index+1]
                index += 2
        row = pd.DataFrame({
                'File': [doc],
                'Date': [Date],
                'Quantity': [Quantity],
                'Description': [Description],
                'Unit Price': [UnitPrice.replace(" ", "").replace("€/kg", "")],
                'Total Price': [TotalPrice.replace(" ", "")]
            })
        df = pd.concat([df, row], ignore_index=True)

print(df.loc[56,:])

File           (page 0 of ./20250809 Mercadona 122,05 €.pdf)
Date                                       09/08/2025 14:43 
Quantity                                                   1
Description                                           BANANA
Unit Price                                              1,40
Total Price                                             0,90
Name: 56, dtype: object

Improves on core.

Now that I have analysed one file, I just have to do 147 more. Here is the full code.
Modifications:

Add import glob: For locating and handling multiple files based on their paths efficiently.
Add seaborn: A Python data visualization library based on matplotlib
Add matplotlib: Is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Added a loop iteration beetwen receipts.
Remove space in Date.
Change 'File': [doc] by 'File': [CurrentFile]
Reset UnitPrice each time start a new product.
Some tickets have an special way to write "FISH" (PESCADO in spanish) so they should treated different.
Split the code into functions just for a better reading.

The next block of code is main.py. This file uses three functions to process the data: ExtractData as I showed above and two more to do some parsing in a good format: CleanData and SummaryByMonth. Finally I plot all in xkcd format.

import pymupdf
import re
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns

image = "accounting-post-files/butterfly_effect_detected.svg"

summary = SummaryByMonth(CleanData(ExtractData("accounting-post-files/tickets/*.pdf")))
summary["Month"] = summary["Month"].astype(str)
summary["Total Expense"] = pd.to_numeric(summary["Total Expense"], errors="coerce")

with plt.xkcd():
    fig, ax = plt.subplots(figsize=(10, 5))
    fig.subplots_adjust(bottom=0.2)

    for spine in ("top", "right"):
        ax.spines[spine].set_visible(False)

    sns.lineplot(data=summary, x="Month", y="Total Expense", marker="o", ax=ax)

    ax.set_xlabel("")
    ax.set_ylabel("EUROS (€)")
    plt.xticks(rotation=45, fontsize=8, ha="right")

    annotate_event('FIRST BUTTERFLY EFFECT\n(I started working)', 0, 5)
    annotate_event('SECOND BUTTERFLY EFFECT\n(I got my license)', 15, 20)

    fig.text(0.5, 0.05, 'MONTH', ha='center')
    fig.suptitle("TOTAL EXPENSES")

    plt.savefig(image)
    print(image)

Final thoughts

This post has resulted in many things. First of all, this is my first published post. I am very happy about this milestone because I have lots of drafts but have never had the courage to publish any of them.

As for my thoughts on this mini-project… It was a good evening writing Python code. The results were, in a way, truly revealing. I mean, the first butterfly effect in my finances was starting to work, because before that I had never had this type of expense.

However, the second butterfly effect came as a revelation to me. How can a car change your life? I only got my licence recently, so why am I spending more? I have no evidence of this, but it could be a correlation. It makes sense to me when you consider that with a car, it's easier to buy food because you don't have to carry it, and if the purchase depends on its weight, you'll think twice about how to carry it.

Regarding the project, I would like to continue. Maybe a second post could enhance this analysis with more statistics, and why not, a new parameter such as location? I would like to publish the experiment as a tool for the community — maybe it could publish it in social network to generate some traffic.

Without further delay… Thanks for reading.

Ghostbusters/Transcript Modified
Are you troubled by strange noises in the middle of the night?
Do you experience feelings of dread in your basement or attic?
Have you or any of your family ever seen a spook, specter or ghost?
If the answer is yes, then don't wait another minute. Pick up your phone and call the professionals.

Contact at me at rhyloo dot com!