The MP4 That Can't Stream: Why Your Browser Stutters (and How to Fix It)
The Problem: It Works Locally, But Not in the Browser
You download a video, open it in VLC — perfect. You drop it into a web page or a self-hosted streaming app, click play, and… nothing. Blank screen. Infinite spinner. Or a 2-second delay before playback starts, every single time.
The video file is valid. The browser supports H.264. The server is sending the right bytes. What gives?
The answer is hiding in the MP4 container’s internal structure. Most developers know MP4 is a “container format” — but few realize there are structural variants that look identical from the outside yet behave completely differently in a browser streaming context. Two particularly nasty variants are concatenated MP4s and fragmented MP4s.
A Quick Primer: How an MP4 Is Organized
An MP4 file is built from ISO Base Media File Format (ISOBMFF) boxes (also called atoms). Every MP4 has at least these boxes in order:
1
[ftyp] → [moov] → [mdat]
ftyp: File type box — declares the brand (e.g.,mp42,isom).moov: Movie box — the index. Contains sample tables, track metadata, and byte offsets telling the decoder where each frame lives insidemdat.mdat: Media data box — the actual audio and video frames.
The browser needs the moov box to know the video’s duration, dimensions, codec, and — crucially — where each frame is. That’s what enables seeking, time display, and the playhead.
When streaming over HTTP (as every browser
Variant 1: Concatenated MP4s — The Frankenstein File
What It Looks Like
1
[ftyp] → [moov] → [mdat A] → [mdat B] → [mdat C]
A concatenated MP4 has multiple consecutive mdat boxes after a single moov. This happens when files are naively joined with cat or when download tools (notably YouTube-DL / yt-dlp when interrupted and resumed) write multiple media segments into one container.
The Root Cause
The moov box’s sample tables only reference the first mdat box. The second and third mdat boxes are orphaned — the decoder doesn’t know about them. The file is not technically malformed (it follows ISOBMFF grammar), but it’s semantically broken for any player that actually processes the sample table.
The Symptom
- Blank player: Browser shows a black screen or loading spinner forever.
- Only first N seconds play: Some players detect the first
mdat, play that segment, then freeze. - Works in VLC, fails in browser: VLC is forgiving — it often brute-force scans the entire file to find media. Browsers follow the spec strictly and rely on the
moovsample table.
Where to Look
Hex dump or binary scan. Look for the mdat ASCII marker after the moov box. If you see:
1
... mdat ... mdat ... mdat
consecutively (no moof or moov between them), you have a concatenated MP4. You can detect this programmatically by walking the box hierarchy:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def is_concat_mp4(filepath: str) -> bool:
with open(filepath, "rb") as f:
# Validate ftyp header
header = f.read(8)
if len(header) < 8 or header[4:8] != b"ftyp":
return False
ftyp_size = int.from_bytes(header[0:4], "big")
f.seek(ftyp_size - 8, 1)
# Walk boxes looking for moov, then count mdat boxes
while True:
box_header = f.read(8)
if len(box_header) < 8:
return False
box_size = int.from_bytes(box_header[0:4], "big")
box_type = box_header[4:8]
if box_type == b"moov":
f.seek(box_size - 8, 1)
mdat_count = 0
while True:
next_header = f.read(8)
if len(next_header) < 8:
break
next_size = int.from_bytes(next_header[0:4], "big")
next_type = next_header[4:8]
if next_type == b"mdat":
mdat_count += 1
f.seek(next_size - 8, 1)
elif next_type in (b"moof", b"moov"):
return False # fragmented, not concat
else:
break
return mdat_count > 1
else:
f.seek(box_size - 8, 1)
Variant 2: Fragmented MP4s (FMP4) — The Segment Monster
What It Looks Like
1
[ftyp] → [moov] → [moof] → [mdat] → [moof] → [mdat] → [moof] → [mdat]
A fragmented MP4 (FMP4) breaks the media into Movie Fragments. Instead of one giant moov + one giant mdat, it stores a lightweight initial moov (often called the “movie header”) plus alternating moof (movie fragment) and mdat pairs. Each moof+mdat segment is self-contained: the moof describes how to decode the following mdat.
This is the standard format for MPEG-DASH and HLS streaming, where segments need to be independently addressable for adaptive bitrate switching. But when you take a fragmented MP4 and serve it as a plain file over HTTP Range requests, things break.
The Root Cause
With fragmented MP4s, the browser doesn’t know the full duration or frame layout until it has parsed every moof box. The initial moov only contains the “skeleton” — track IDs, timescale — but not the sample table. The browser must:
- Download and parse the initial
moov. - Start downloading segments.
- Parse each
moofto discover frame offsets. - Only then can it report duration, enable seeking, and render frames.
For a 60-minute video, this means the browser might need to download and parse dozens or hundreds of moof boxes before it can build a complete picture. Every seek or load triggers this process anew.
The Symptom
- 1–2 second delay on every page load: The browser is parsing moof boxes before playback can begin.
- Seeking is sluggish or broken: After a seek, the browser must find the nearest moof+mdat pair’s keyframe.
- Duration shows “NaN” or “0:00” initially: The player can’t know the length until it has processed all moof boxes.
- Works fine in VLC: VLC pre-scans the entire file and builds an internal index. The browser spec doesn’t require this.
Where to Look
Open the file in a hex viewer or scan for the ASCII string moof:
1
2
3
4
def is_fragmented_mp4(filepath: str) -> bool:
with open(filepath, "rb") as f:
data = f.read(10 * 1024 * 1024) # First 10MB
return b"moof" in data
In practice, detecting fragmented MP4s is simpler than detecting concatenated ones — just check for the presence of moof anywhere in the file. These boxes are at minimum 8 bytes, so a substring match on raw binary data is reliable.
The Single Fix: ffmpeg -movflags faststart
Both problems share a common solution: remux with faststart. Faststart (also called “Web Optimized” or “MOOV at front”) does two things in a single pass:
- Merges all media segments into a single
mdatbox, eliminating concatenation artifacts. - Rewrites the sample table so the
moovbox is complete and placed at the beginning of the file (before themdat), eliminating the fragmented structure.
The command:
1
ffmpeg -i input.mp4 -c copy -movflags faststart -f mp4 -y output.mp4
Why This Works
| Flag | Effect |
|---|---|
-c copy |
Stream copies — no re-encoding, no quality loss, near-instant. |
-movflags faststart |
Repositions the moov box to the front of the file and rebuilds the sample table from all media data. |
-f mp4 |
Forces MP4 output format (ensures correct structure). |
-y |
Overwrite output without prompting. |
After remux:
1
[moov] → [mdat (single, complete)]
The browser receives moov first, immediately knows the full structure, and can start playback instantly. Seeking works. Duration shows correctly. No more blank screen.
Why -c copy Matters
Re-encoding a 4GB video could take hours. Stream copying (-c copy) tells FFmpeg to copy the compressed bitstream verbatim — no decode, no re-encode. The operation is pure container surgery: it reads the existing frames and writes them into a new container with correct box structure. A 10GB video remuxes in seconds.
Verification
Always verify the output:
1
ffprobe -v quiet -print_format json -show_format -show_streams output.mp4
Check that:
format.nb_streamsequals the expected count (usually 2: video + audio).streams[].codec_typeandcodec_namematch the original.- The file opens instantaneously in a browser
<video>element.
Suggested Image Description
A simplified ISOBMFF box diagram showing three MP4 structures side by side. Left: a normal MP4 with
[ftyp] → [moov] → [mdat]— clean and compact. Center: a concatenated MP4 with[ftyp] → [moov] → [mdat A] → [mdat B] → [mdat C]— three mdat boxes in a row, the second and third orphaned. Right: a fragmented MP4 with[ftyp] → [moov] → [moof] → [mdat] → [moof] → [mdat]— alternating moof/mdat pairs. Each moof box is highlighted in red to emphasize the overhead. Below all three, an arrow pointing to the fixed version:[moov at front] → [mdat, single]. Use a dark background with blue boxes for normal structure, red highlights for problematic elements, and green for the fixed output.
Summary
| Variant | Symptom | Cause | Detection |
|---|---|---|---|
| Concatenated MP4 | Blank player, only first N seconds play | Multiple orphaned mdat boxes; moov only references the first |
Binary scan for consecutive mdat after moov |
| Fragmented MP4 | 1–2s startup delay, seeking broken | Alternating moof+mdat segments; browser must parse all moof boxes |
Binary scan for moof in first 10MB |
Both are trivially fixed with a single FFmpeg command. No re-encoding needed. No quality loss. Just a few seconds of container surgery, and your browser gets the well-behaved MP4 it deserves.
Happy streaming 🎥