GlobeScraper | Jamie – Software Developer

By the numbers

Scale at a glance.

API routes

30+

Prisma models

Scraper sources

45+

CLI scripts

~35k

Listings scraped

50+

React components

970

Schema lines

300+

GeoJSON districts

Features

What the platform actually does.

Rental marketplace

Search with filters (city, district, beds, type, price), pagination, image carousels, saved listings. Data from 6 active scraper sources across Cambodia.

Scraping pipeline

7 source adapters (Cheerio + Playwright for Cloudflare bypass). Parallel workers, atomic queue claiming, content fingerprinting, human-like pacing with jittered delays.

AI content engine

End-to-end article pipeline: competitor research via Serper.dev, gap analysis, Gemini 3 Flash generation, Imagen 4.0 images, auto-publish with SEO scoring.

Community

Public/private profiles, connections, DMs, meetups with RSVPs, trust panels, and a report/moderation system. Rate-limited via Upstash Redis.

Analytics heatmap

Interactive Leaflet map with 300+ Cambodia district boundaries. Daily/monthly price indices, KPI cards, trend charts, volatility analysis.

Email campaigns

9 block types, 5 template presets, AI content generation via Gemini, Resend integration with delivery tracking and scheduled cron delivery.

Rental pipeline

How data flows from source sites to the marketplace.

Discover

Crawl category pages from 6 sources, extract listing URLs, enqueue in ScrapeQueue.

Process

Atomic claiming, fetch + parse, upsert with content fingerprinting, geocoded titles.

AI review

Gemini classifies residential vs non-residential, corrects types, rewrites descriptions.

Index

Daily aggregation: median, mean, p25, p75 by city, district, beds, and type.

Stack

What powers each layer.

Framework

Next.js 14 (App Router)

Language

TypeScript 5.5

Database

MySQL + Prisma 5.18

Auth

Auth.js v5 (NextAuth)

Gemini 3 Flash + Imagen 4.0

Scraping

Cheerio + Playwright

Rate limiting

Upstash Redis

Resend + Vercel Cron

Maps

Leaflet + GeoJSON

Styling

Vanilla CSS (BEM)

Deploy

Vercel + Hetzner VPS

Testing

Vitest + Playwright

Decisions

Key architectural choices and why.

No Tailwind

All styling uses vanilla CSS with custom properties and BEM naming. Full control, no dependency bloat, easy to debug in DevTools.

Playwright for Cloudflare

Khmer24 blocks HTTP scrapers with Cloudflare WAF. Playwright with headless Chromium bypasses it. Other sources use lightweight Cheerio.

Human-like pacing

Jittered delays (1.2–2s), random breathers every 40–70 pages, night idle simulation, skip probability : avoids detection and bans.

Atomic queue claiming

Parallel workers claim batches via SQL UPDATE…LIMIT. No coordinator process, no locking conflicts, works with N workers.

Gemini over GPT

Google Gemini 3 Flash is fast, cheap, and handles structured JSON output well. Used for both article generation and listing classification.

Hetzner for scrapers

Scraper scripts need Playwright (browser automation) which can't run on Vercel serverless. The CX23 runs daily/weekly scrapes on a schedule.