Natural Unit Prices for Grocery Flyers: Why $/kg Hides What You Actually Pay
When your grocery price scraper normalises everything to $/kg, you lose the signal consumers actually see. Here is how Panier Futé learned to parse 473 ml, barquette, sac, and botte without hardcoding a lookup table.
When I started scraping Quebec grocery flyers for Panier Futé, the unit price problem looked trivial. Every flyer has a base price and a quantity. Normalise to $/kg and move on. That is what every grocery comparison site does.
It is also wrong for the user.
A can of diced tomatoes costs $1.99 for 796 ml. Showing "$2.50/kg" is technically correct but meaningless to someone deciding whether that is a good deal. The shelf label at the store says "$1.99/796 ml." The consumer compares that against the brand next to it at $2.29 for the same 796 ml can. Nobody converts to per-kilo in their head.
Worse, the normalised price actively misleads across categories. A $4.99 bag of chips at 200 g works out to $24.95/kg, which makes it look absurdly expensive next to $2.50/kg tomatoes. But nobody substitutes chips for tomatoes. The comparison is useless.
What the flyers actually say
Quebec grocery flyers (IGA, Metro, Super C, Maxi) use a baffling variety of packaging units:
- Weight/volume in parentheses: "1 ch (473 ml)", "3 barq (170 g)", "1 paquet (900 g)"
- Clean weight measures: "2.2 kg", "700 g", "1.5 L", "3 lb"
- Named units: "1 pot", "3 tetes", "2 boules", "1 botte", "1 contenant"
- Count-based: "3 unites", "12", "24"
- Compound: "1 sac (10 lb)", "1 caissette (12 x 355 ml)"
No single normalisation covers this. And French grocery terms add another layer: "barquette" (a tray of meat or berries), "botte" (a bunch of greens), "contenant" (container). These are not translatable to weight without knowing the product.
The rule cascade
The first attempt was a simple extraction: grab the leading number from the quantity string, divide the price, append "/unit." That worked for "3 unites" but fell apart immediately on "1 ch (473 ml)" because the count was 1 but the real unit was inside parentheses.
The fix was a priority-ordered cascade of regex patterns, each trying a more specific match before falling through:
# 1) Check parentheses: "1 ch (473 ml)", "3 barq (170 g)"
m = re.search(r'\\((\\d+[.,]?\\d*)\\s*(ml|g|kg|oz|lb|l)\\)', qty, re.IGNORECASE)
if m:
# Show "$1.99/796 ml", not "$2.50/kg"
return f"${prix:.2f}/{amount} {unit}"
# 2) Direct weight: "2.2 kg", "700 g"
m = re.match(r'(\\d+[.,]?\\d*)\\s*(kg|lb|g|ml|l)\\b', qty, re.IGNORECASE)
if m:
amount = float(m.group(1).replace(",", "."))
unit = m.group(2).lower()
# Items under 1kg show total price per package
if unit in ("g", "ml") and amount < 1000:
return f"${prix:.2f}/{amount:.0f} {unit}"
# Bulk items show price per unit
unit_price = prix / amount
return f"${unit_price:.2f}/{unit}"
# 3) Named units: "1 pot", "3 tetes", "1 botte"
m = re.match(r'(\\d+)\\s+(\\S+)', qty)
if m:
count = int(m.group(1))
unit_word = m.group(2).rstrip("s")
if count == 1:
return f"${prix:.2f}/{unit_word}"
unit_price = prix / count
return f"${unit_price:.2f}/{unit_word}"
return None
The key insight: never show a normalised unit when the shelf unit is right there. A 473 ml can of beer costs $3.99. Show "$3.99/473 ml," not "$8.44/L." The shopper recognises the can size instantly. The per-litre number triggers cognitive load and breeds distrust.
When normalisation still matters
The cascade falls back to $/kg or $/L only when the quantity contains no natural packaging unit at all. Two cases:
-
Bulk items where the consumer divides themselves: a 5 kg bag of flour. Nobody buys a 5 kg bag by the gram. Show "$0.78/kg" because the consumer compares it against the 2 kg bag at "$0.92/kg." The comparison happens in the same category, same product type.
-
Commodities where per-unit IS the shelf unit: loose apples at $2.49/lb, or bulk oats at $1.99/kg. There is no packaging unit to preserve.
What this taught me about data pipelines
Grocery flyer data is not clean. It enters the pipeline from structured JSON feeds, HTML scraping, and manual entry. Each source has its own convention for quantities. Trying to normalise at ingestion loses information you cannot reconstruct later.
The better approach: store the raw quantity string verbatim, then compute display prices at render time with a priority cascade. The raw string is the source of truth. If the cascade misses a case, you fix the renderer, not the data. You do not need a migration. You do not need to reprocess 10,000 historical items.
This is a general principle that applies far beyond groceries. Any time you scrape or ingest human-facing data, store the original representation. Build your parser as a read-time function that can evolve independently of your storage schema. Your future self will thank you when the fifth grocery chain introduces "1 poche (2 x 1 kg)" and the old cascade breaks.
The bottom line
Unit prices exist to help consumers compare. If your normalisation makes comparison harder instead of easier, you are defeating the purpose. Show people the units they see on the shelf. Fall back to normalised units only when there is no shelf unit to preserve.
The cascade is three regex patterns and sixty lines of Python. It replaced a single $/kg line that was technically correct and practically useless.