From arXiv to Inbox in 8 Stages: Automating a Weekly AI Podcast

What It Does

Every Wednesday and Saturday at 10 AM, a Windows batch script wakes up on my machine, pulls the latest code from main, and kicks off an 8-stage pipeline. No human involvement. By the time I check my email, a new podcast episode exists on the site, subscribers have been notified, and the Netlify deploy has already finished.

The pipeline fetches 30 recent AI papers from arXiv, has Gemini 2.0 Flash rank them for podcast potential, launches a headless-ish Chrome to drive NotebookLM's web UI, downloads the generated audio, uploads it to Google Drive, generates written content, commits the episode to the repository, pushes to main to trigger a Netlify deploy, and emails every subscriber a notification with a "Listen Now" button.

Here is what the console output looks like on a successful run:

The whole thing takes about 25 minutes. Most of that is NotebookLM generating audio. The rest is fast.

This post walks through every stage, with real code from the production pipeline, including the bugs that took longer to fix than any of the AI integrations.

The 8 Stages

Stage 1: arXiv Fetch

The pipeline starts by pulling 30 recent papers from the cs.AI category on arXiv. The API is free, XML-based, and surprisingly reliable.

The query URL ends up being: http://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=30&start=0

One thing worth noting: the API occasionally fails silently, returning valid XML with zero entries. The pipeline has a retry with a 5-second backoff, plus a clean exit if zero papers come back. No sense generating a podcast about nothing.

Stage 2: Gemini 2.0 Flash Ranking

Thirty papers is too many for a podcast episode. The pipeline sends all of them to Gemini 2.0 Flash with a structured prompt that evaluates each paper on four criteria: novelty, business impact, talkability, and breadth of interest.

Gemini returns a JSON array. The pipeline parses it, maps indices back to the full paper objects, sorts by score, and takes the top 3. If the JSON parsing fails (it does, maybe 5% of the time), there is a fallback that just takes the first three papers with a default score of 5.

json\n?/g, "").replace(/

The stripping of markdown code fences before parsing is necessary because Gemini sometimes wraps its JSON output in triple backticks even when you explicitly tell it not to. This is one of those patterns you learn to code around defensively.

Stage 3: NotebookLM Browser Automation

This is the most interesting stage and the most fragile. NotebookLM does not have a public API. To generate podcast audio, the pipeline launches a real Chrome browser using Playwright, drives the NotebookLM web UI, and downloads the result.

The key detail is launchPersistentContext with a dedicated user data directory at ~/.notebooklm-automation. This directory holds a Chrome profile where Google is already authenticated. The first time you run the script, you log in manually. After that, the session persists.

The automation sequence is: navigate to NotebookLM, create a new notebook, click "Websites" in the source panel, paste all arXiv URLs into the textarea, click "Insert", wait for sources to process, then click "Customize Audio Overview", select "Deep Dive" format with "Long" duration, fill in the episode focus prompt, and click "Generate".

Then you wait. Audio generation takes 8-15 minutes. The pipeline polls every 15 seconds, checking for a Play button or an <audio> element in the DOM.

Downloading the audio is its own adventure. You cannot just fetch the audio URL — it is served from googlevideo.com with browser cookies and authentication. The pipeline intercepts the response body directly when the Play button is clicked, captures the raw audio buffer from the Playwright response event, and falls back to a "More" menu download if interception fails.

The failure mode here is real: if Chrome is already open with the same user data directory, Playwright cannot launch a persistent context. The pipeline crashes immediately. This means if I forget to close Chrome before the scheduled run, the episode does not get generated. I have not automated closing Chrome because that would kill whatever I am actually working on.

Beyond audio, the pipeline also generates a video overview, a briefing doc report, a mind map screenshot, and a quiz — all through the same browser automation session. Each artifact is generated sequentially, extracted from the UI, and passed to later stages.

Stages 4-5: Google Drive Upload

The audio MP3 (typically 15-20MB) and optional video MP4 need to be hosted somewhere accessible. The pipeline uploads them to a Google Drive folder using the Drive API with OAuth2.

Notice the return value: /media/${fileId}, not a Google Drive URL. This is important. Google Drive direct download links come with CORS restrictions, Content Security Policy issues, and redirect chains that break <audio> and <video> elements in the browser. Every browser handles them differently. I spent two days fighting with this before building a server-side proxy.

The Media Proxy

The /media/[id] route is a Next.js API route that proxies requests through to Google Drive. It solves every browser compatibility problem in one place.

Three details here that each took their own debugging session:

confirm=t: Google Drive shows a virus scan warning page for files over ~100MB. Without this parameter, you get an HTML page instead of audio data. The pipeline checks for content-type: text/html in the response and returns a 404 if Drive sends back HTML instead of audio.
Accept-Ranges: bytes: Without this header, the browser's audio player cannot seek. It will play from the beginning, but clicking to skip forward does nothing. The Range header forwarding enables proper seeking by passing the browser's byte range requests through to Drive.
Cache-Control with s-maxage: This caches the audio at the Netlify CDN edge for a week. Without it, every play request hits Google Drive directly, which is slow and counts against API quotas.

Stage 6: Gemini Content Generation

While NotebookLM handles the audio, Gemini 2.0 Flash generates the written content for the episode page: a title, summary, key insights, executive summary, and per-paper deep dives.

If NotebookLM's Briefing Doc report was successfully extracted in Stage 3, it replaces the Gemini-generated executive summary. The NotebookLM report tends to be better — more nuanced, better structured — because it had access to the full paper content through its source processing, not just the abstracts.

The output is an MDX file with YAML frontmatter containing the audio URL, paper metadata, key insights, and quiz data. This file goes into src/app/resources/episodes/ and is picked up by the Next.js build.

Stage 7: Git Commit and Push

The batch script handles this outside of the Node.js pipeline:

The push to main triggers Netlify's build hook. Within 2-3 minutes, the new episode is live on the site. No manual deploy step. No staging environment. The episode MDX file is the deployment artifact.

Stage 8: Email Notification

The final stage sends an email to every active subscriber. Subscribers come from Netlify form submissions — anyone who filled out the newsletter, contact, or appointment booking form on the site.

Each email gets a personalized unsubscribe link: https://agor.me/api/unsubscribe?email=user@example.com.

This stage had the most annoying bug of the entire project.

The Bug That Should Not Have Been Hard

The email module originally looked like this:

The transporter was created at module load time. In the pipeline script, dotenv.config() runs at the top of generate-resources.ts, but by the time it executes, the email module has already been imported and the transporter has already been created with undefined credentials. The fix: lazy evaluation.

The Proxy wrapper is there for backward compatibility — other parts of the codebase import transporter directly and call methods on it. The proxy intercepts every property access and creates a fresh transporter with the current (now-loaded) environment variables.

This bug took 90 minutes to diagnose. The error message was SMTP credentials incomplete. HOST=false, USER=false, PASS=false — which made it look like the .env.local file was missing or malformed. It was not. The credentials were there. They just were not loaded yet when the module initialized.

The Plumbing That Took Longer Than The AI

If you are building something like this, here is what will actually consume your time.

Windows Task Scheduler. The batch script is registered with schtasks. The tricky part is getting the environment right — Task Scheduler runs in a different session with different PATH and environment variables than your terminal. The batch file explicitly loads .env.local by parsing it line by line with a for /f loop:

Google OAuth2 refresh tokens. The Drive upload uses OAuth2 with a refresh token obtained through the OAuth2 Playground. Getting this right requires: creating OAuth credentials in Google Cloud Console, adding https://developers.google.com/oauthplayground as an authorized redirect URI, going to the playground, configuring it to use your own OAuth credentials (not Google's), authorizing Drive scopes, exchanging the authorization code for a refresh token, and storing that token in .env.local. If the token expires (they can, depending on the OAuth app's publishing status), the entire pipeline breaks silently — the upload returns a 401 and falls back to saving the MP3 locally in the repo, which bloats the git history.

The .env.local reconstruction problem. The production environment variables live in Netlify. The pipeline runs locally. These need to stay in sync. Netlify has netlify env:pull which creates a .env file, but it is named .env, not .env.local, and it contains variables for the Netlify Functions runtime that do not exist locally (like NETLIFY_ACCESS_TOKEN). I maintain .env.local manually, which is a maintenance burden I have not solved elegantly.

Log rotation. The batch script keeps the last 20 log files and deletes older ones. A small thing, but without it the logs directory grows indefinitely and you only discover it when your disk fills up three months later.

What Breaks (And How)

I want to be honest about the fragility here, because every automation blog post makes things sound more reliable than they are.

NotebookLM cannot launch if Chrome is already open. Playwright's persistent context mode locks the user data directory. If my browser is open with the same profile, the pipeline crashes at Step 3 with a lock file error. Mitigation: the automation uses a separate profile directory (~/.notebooklm-automation), but if I ever opened Chrome with that profile manually and forgot to close it, the scheduled run fails.

NotebookLM's UI changes without warning. The automation uses CSS selectors and aria-label attributes to find buttons: button[aria-label="Customize Audio Overview"], button:has-text("Generate"), textarea[aria-label="What should the AI hosts focus on in this episode?"]. When Google updates the NotebookLM UI — and they do, regularly — these selectors break. There is no API contract. There is no changelog. The pipeline just fails and the log says "timeout waiting for selector."

Google Drive virus scan blocks large files. Files over roughly 100MB trigger a virus scan interstitial page. The confirm=t parameter bypasses it, but this is an undocumented behavior that could change at any time. If Google removes this bypass, every audio file over 100MB stops streaming through the proxy.

Module-level env var reads. The email bug I described is a pattern that appears everywhere in Node.js codebases. Any module that reads process.env at import time (instead of at call time) will silently get undefined values if it is imported before dotenv.config() runs. The fix is always the same: lazy evaluation. But you have to know to look for it.

Gemini JSON parsing. About 5% of the time, Gemini returns JSON wrapped in markdown code fences, or with trailing commas, or with comments. The pipeline strips code fences and has fallback content for every field, but occasionally the fallback produces a bland episode title like "AI Papers Weekly: Latest Research" instead of something engaging. Not a crash, but a quality degradation.

The Takeaway

Production AI pipelines are 20% AI and 80% plumbing.

The AI parts — calling the arXiv API, prompting Gemini, driving NotebookLM — are conceptually straightforward. You read the docs, you write the prompt, you parse the output. A competent developer can get each of these working in isolation in an afternoon.

The plumbing is where the value is. Range request proxying for audio seeking. OAuth2 refresh token lifecycle management. Environment variable loading order in module systems. Scheduled task execution on Windows. Git auto-commit without conflicts. Email delivery with personalized unsubscribe links. Log rotation. Error recovery. Fallback paths for every stage that can fail.

Anyone can call an API. Few can make it run unattended, twice a week, for months, producing something a subscriber actually wants to listen to.

The full pipeline code is in production at agor.me/resources. If you're building similar automation — let's talk.