Porch Pirates vs. Delivery Drivers, Building a Zero-Shot Video Classifier That Actually Works
How we went from 0% accuracy to 100% by throwing away action recognition and detecting state change instead.
The Problem
You have a Ring-style porch camera. It records 24/7. You want to know two things:
- Delivery: Someone placed a package on my porch.
- Theft: Someone took a package from my porch.
Sounds simple. It isn’t. The visual difference between “placing a package down” and “picking a package up” in any given video frame is essentially zero. A person bent over a box on a porch looks the same whether they’re dropping it off or stealing it. Every model we tried got this wrong — until we stopped trying to recognize actions and started detecting state change.
This post walks through the full journey: what we tried, why it failed, the insight that fixed it, and where the approach breaks down.
Table of Contents
- From Stream to Clip: The Upstream Problem
- Attempt 1: V-JEPA2 — Supervised Action Classes
- Attempt 2: X-CLIP — Zero-Shot Video-Text Matching
- The Insight: State Change, Not Action Recognition
- The Working System: Two-Tier State-Change Detection
- Test Results
- Where This Breaks Down & What Comes Next
1. From Stream to Clip: The Upstream Problem
Before any classifier sees a frame, something has to decide when to look. A porch camera generates ~2.5 million frames per day at 30fps. You can’t run a vision-language model on every one of them.
The production architecture for Project Argus uses a two-stage pipeline:
Stage 1: Motion Detection + Lightweight Object Detection
The camera’s continuous RTSP stream feeds into a circular buffer (last 10 seconds at 30fps = 300 frames). A lightweight edge model runs on every frame:
RTSP Stream
|
v
[Frame Buffer: 300 frames, circular]
|
v
[Stage 1: YOLO-NAS / MobileNet-SSD] <-- runs on every frame, ~5ms/frame
|
+-- No person detected --> discard, keep buffering
|
+-- Person detected on porch -->
|
v
[Wait for person to exit frame or N seconds timeout]
|
v
[Extract the buffered clip]
|
v
[Stage 2: Video Classifier] <-- runs on ~10sec clips, ~7s/clip
Why buffer before the trigger? Because by the time you detect a person, they may already be mid-action. The circular buffer ensures you have the 5-10 seconds of footage leading up to the detection, which often contains the critical “before” state (empty porch vs. package already there).
What makes a good clip? The classifier needs to see a state transition. The ideal clip:
- Starts with the “before” state (empty porch or package on porch)
- Shows a person interacting with the scene
- Ends with the “after” state (package on porch or empty porch)
A 10-second window around the person-detection trigger typically captures this. If the person is on-screen for longer, you can take the first and last 2-3 seconds and skip the middle — the classifier only cares about the endpoints.
Stage 1 Model Choices
| Model | Size | Speed (M3) | Notes |
|---|---|---|---|
| YOLO-NAS Nano | 3.5M params | ~3ms/frame | Best accuracy/speed ratio for edge |
| MobileNet-SSD v2 | 4.3M params | ~5ms/frame | Solid fallback, well-supported |
| YOLO11n | 2.6M params | ~2ms/frame | Ultralytics, easiest to deploy |
The key constraint: Stage 1 must be fast enough to run on every frame and sensitive enough to never miss a person. False positives are fine (the classifier will reject them). False negatives mean missed thefts.
For the rest of this post, we assume Stage 1 has done its job and handed us a 10-second clip to classify.
2. Attempt 1: V-JEPA2 — Supervised Action Classes
Our first attempt used Meta’s V-JEPA2, a state-of-the-art video understanding model.
The Idea
V-JEPA2 (facebook/vjepa2-vitg-fpc64-384-ssv2) is a ViT-Giant model fine-tuned on Something-Something v2 (SSv2), a dataset of 174 fine-grained temporal action classes like “Putting something onto something” and “Taking something from somewhere.” The plan: classify the video into SSv2 classes, then map keywords to security actions.
# The V-JEPA2 approach (simplified)
MODEL = AutoModelForVideoClassification.from_pretrained(
"facebook/vjepa2-vitg-fpc64-384-ssv2",
torch_dtype=torch.float16,
).eval().to(device)
# ... run inference, get predicted SSv2 class label ...
predicted_label = MODEL.config.id2label[predicted_idx].lower()
# Brittle keyword mapping
if "putting" in predicted_label or "dropping" in predicted_label:
print("=> Delivery Detected")
elif "taking" in predicted_label or "picking" in predicted_label:
print("=> Theft Detected")
Why It Failed
Problem 1: The model is enormous. ViT-Giant has over 1 billion parameters. On an M3 MacBook with MPS, it barely fits in memory, inference is slow, and FP16 on MPS produces numerical garbage. This isn’t an edge-deployable model.
Problem 2: SSv2 classes don’t map to security events. SSv2 was designed for fine-grained manipulation tasks filmed on tabletops. Its classes include things like “Pushing something from left to right” and “Pretending to put something behind something.” These are the wrong ontology for porch camera footage. The keyword mapping ("putting" in label) is fragile — the model might predict “Putting something on a surface” for a theft if the person briefly sets the package down before walking away.
Problem 3: It’s a closed-vocabulary classifier. V-JEPA2 can only output one of 174 SSv2 labels. It has no concept of “porch,” “package,” “delivery,” or “theft.” You’re forcing a square peg into a round hole.
Verdict: Wrong model for the job. Massive, slow, and the class ontology doesn’t match the domain.
3. Attempt 2: X-CLIP — Zero-Shot Video-Text Matching
The second attempt used Microsoft’s X-CLIP, a CLIP-based model with temporal attention across video frames.
The Idea
X-CLIP (microsoft/xclip-base-patch16-zero-shot) extends CLIP to video by adding cross-frame attention. Unlike V-JEPA2, it’s zero-shot — you provide natural language descriptions of what you’re looking for, and the model scores the video against each description.
# The X-CLIP approach (simplified)
labels = [
"a person dropping off a package to the porch",
"a person picking up a package from the porch",
"an empty front porch with no activity",
]
inputs = processor(text=labels, videos=video_frames, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_video.softmax(dim=-1)
Why It Failed
Problem 1: API breakage. In Transformers 5.x, the X-CLIP processor silently ignores the videos= kwarg and only produces pixel_values through images=. The code runs without error but feeds no visual data to the model. A subtle, infuriating bug.
Problem 2: FP16 on MPS. Originally we used torch.float16, which produces silent numerical corruption on Apple Silicon’s MPS backend. Scores come out as garbage.
Problem 3 (the fundamental one): Temporal attention doesn’t capture directionality. We tested X-CLIP with the correct API (images=, FP32) and it still failed on theft detection. We created a synthetic theft video by time-reversing a delivery video. X-CLIP scored both the original delivery and the reversed theft as ~65% “delivery.” The temporal attention sees the same content (person, packages, porch) regardless of order. It cannot distinguish “package appears” from “package disappears.”
This is the critical insight: video-level temporal attention helps with what is happening (basketball vs. tennis) but not with the direction of change (delivery vs. theft).
Verdict: Closer to the right idea (zero-shot, natural language), but temporal attention alone can’t solve this problem.
4. The Insight: State Change, Not Action Recognition
After both attempts failed, we looked at the actual frames:
delivery2.mp4 — Package Delivery
| First Frame | Middle Frame | Last Frame |
|---|---|---|
|
|
| | Empty porch, doormat only | Person bending over boxes | Two boxes stacked on porch |
theft_synthetic.mp4 — Package Theft (reversed delivery)
| First Frame | Middle Frame | Last Frame |
|---|---|---|
|
|
| | Two boxes stacked on porch | Person interacting with boxes | Empty porch, doormat only |
Look at the middle frames. The person is bent over packages on a porch in both cases. Any frame-level action classifier will see the same thing. “Placing” and “picking up” are visually identical mid-action.
But look at the first and last frames. The difference is obvious:
- Delivery: First frame has no package. Last frame has a package. Package appeared.
- Theft: First frame has a package. Last frame has no package. Package disappeared.
The key insight: don’t classify the action. Classify the state. Then detect which direction the state changed.
This is fundamentally more robust because:
- “Box on porch” vs. “empty porch” is a trivial distinction for any CLIP model
- The temporal signal (presence increases vs. decreases) is unambiguous
- The middle frames (person interacting) don’t matter — only the endpoints
5. The Working System: Two-Tier State-Change Detection
Algorithm
Input: 10-second video clip
Output: DELIVERY | THEFT (with confidence)
1. Extract 8 uniformly-spaced frames from the video
2. For each frame, compute a "package presence score":
- Score the frame against 4 "package present" text prompts
("a cardboard box sitting on a porch", "delivery packages
stacked near a front door", ...)
- Score the frame against 4 "package absent" text prompts
("an empty porch with no packages", "a clean doorstep
with nothing on it", ...)
- Softmax across all 8 prompts
- Sum the "present" probabilities = presence score (0.0 to 1.0)
3. Split the 8 frames into early half (frames 0-3) and late half (frames 4-7)
4. Compute delta = mean(late_presence) - mean(early_presence)
- delta > 0 --> package appeared --> DELIVERY
- delta < 0 --> package disappeared --> THEFT
- |delta| maps to confidence (larger change = more certain)
The text prompts are carefully engineered to describe scene state, not actions:
# "Package present" prompts (what we're detecting)
PACKAGE_PRESENT_PROMPTS = [
"a cardboard box sitting on a porch",
"delivery packages stacked near a front door",
"a parcel left on a doorstep",
"cardboard boxes on the ground near an entrance",
]
# "Package absent" prompts (the baseline)
PACKAGE_ABSENT_PROMPTS = [
"an empty porch with no packages",
"a clean doorstep with nothing on it",
"a front door entrance with no boxes or deliveries",
"an empty entryway with just a doormat",
]
Why Two Tiers?
Both tiers use the same algorithm but different vision-language backbones:
| Tier 1 | Tier 2 | |
|---|---|---|
| Model | SigLIP2-base | OpenAI CLIP ViT-B/32 |
| Source | google/siglip2-base-patch16-224 | openai/clip-vit-base-patch32 |
| Parameters | 86M | 150M |
| Architecture | SigLIP2 (sigmoid loss, no softmax) | Original CLIP (contrastive) |
| Inference time | ~7s on MPS | ~5s on MPS |
| Role | Primary classifier | Independent second opinion |
The two-tier design provides redundancy. In auto mode, Tier 1 runs first. If its confidence falls below 60%, Tier 2 runs as a tiebreaker. In both mode, both run and the system reports both opinions.
Using two architecturally different models (SigLIP2 vs. CLIP) trained on different data ensures their failure modes don’t correlate. If SigLIP2 is confused by unusual lighting, CLIP might not be, and vice versa.
Why SigLIP2 over original CLIP for Tier 1?
SigLIP2 (February 2025) uses a sigmoid loss instead of CLIP’s contrastive softmax loss. This makes it better at independent per-prompt scoring — each image-text pair gets its own score rather than competing in a softmax. For our use case (scoring 8 prompts independently per frame), this is a better fit. It’s also 40% smaller than CLIP ViT-B/32.
Apple Silicon Considerations
Both models run on MPS (Metal Performance Shaders) with two critical constraints:
- Always FP32. MPS FP16 support is incomplete and produces silent numerical errors. The 3-4x throughput gain is not worth wrong answers.
-
PYTORCH_ENABLE_MPS_FALLBACK=1must be set. Some tensor operations fall back to CPU transparently.
6. Test Results
Test Videos
| Video | Resolution | Duration | Content |
|---|---|---|---|
delivery1.mp4 | 1080x1920 | 13.1s | Person approaches porch, places Amazon boxes, leaves |
delivery2.mp4 | 3840x2160 | 10.6s | Person carries boxes to empty porch, stacks them, leaves |
theft_synthetic.mp4 | 3840x2160 | 10.6s | Time-reversed delivery2.mp4 (boxes disappear from porch) |
delivery1.mp4 — Package Delivery
| First Frame | Middle Frame | Last Frame |
|---|---|---|
|
|
| | Person approaching, no box visible yet | Person scanning/placing Amazon boxes | Boxes left on porch, person gone |
Tier 1 (SigLIP2) — Package Presence Timeline:
Frame 0 (early): 0.03 <-- no package yet
Frame 1 (early): ███████████████████ 0.96
Frame 2 (early): ███████████████████ 0.95
Frame 3 (early): ██████████████████ 0.91
Frame 4 (late ): ███████████████████ 0.99
Frame 5 (late ): ██████████████████ 0.94
Frame 6 (late ): ██████████████████ 0.92
Frame 7 (late ): ████████████████████ 1.00 <-- package on porch
Early mean: 0.714 | Late mean: 0.961 | Delta: +0.247
VERDICT: DELIVERY (99% confidence)
The timeline tells a clear story: frame 0 has near-zero package presence (person just arriving), then presence jumps and stays high through the end. The positive delta (+0.247) confidently indicates a delivery.
Tier 2 (CLIP) confirms: Delta: +0.278, DELIVERY (99% confidence).
delivery2.mp4 — Package Delivery
Tier 1 (SigLIP2) — Package Presence Timeline:
Frame 0 (early): 0.00 <-- empty porch
Frame 1 (early): ███████████████████ 0.96
Frame 2 (early): ███████████████████ 0.99
Frame 3 (early): ███████████████████ 0.97
Frame 4 (late ): ███████████████████ 0.96
Frame 5 (late ): ███████████████████ 0.96
Frame 6 (late ): ███████████████████ 0.98
Frame 7 (late ): ████████████████████ 1.00 <-- boxes stacked on porch
Early mean: 0.732 | Late mean: 0.976 | Delta: +0.243
VERDICT: DELIVERY (99% confidence)
Nearly identical pattern to delivery1. The first frame is a perfectly empty porch (presence = 0.00), the last is boxes stacked on the porch (presence = 1.00). Textbook delivery.
Tier 2 (CLIP) confirms: Delta: +0.221, DELIVERY (99% confidence).
theft_synthetic.mp4 — Package Theft
| First Frame | Middle Frame | Last Frame |
|---|---|---|
|
|
| | Boxes stacked on porch | Person interacting with boxes | Empty porch, doormat only |
Tier 1 (SigLIP2) — Package Presence Timeline:
Frame 0 (early): ████████████████████ 1.00 <-- boxes on porch
Frame 1 (early): ███████████████████ 0.98
Frame 2 (early): ███████████████████ 0.96
Frame 3 (early): ███████████████████ 0.96
Frame 4 (late ): ███████████████████ 0.96
Frame 5 (late ): ███████████████████ 0.99
Frame 6 (late ): ███████████████████ 0.98
Frame 7 (late ): 0.00 <-- empty porch
Early mean: 0.974 | Late mean: 0.732 | Delta: -0.242
VERDICT: THEFT (99% confidence)
The mirror image of a delivery. Presence starts at 1.00 (package clearly visible) and drops to 0.00 (empty porch). The negative delta (-0.242) confidently indicates theft.
Tier 2 (CLIP) confirms: Delta: -0.220, THEFT (99% confidence).
Summary Table
| Video | Expected | Tier 1 (SigLIP2) | Tier 2 (CLIP) | Delta (T1) |
|---|---|---|---|---|
| delivery1.mp4 | DELIVERY | DELIVERY 99% | DELIVERY 99% | +0.247 |
| delivery2.mp4 | DELIVERY | DELIVERY 99% | DELIVERY 99% | +0.243 |
| theft_synthetic.mp4 | THEFT | THEFT 99% | THEFT 99% | -0.242 |
6/6 correct across both tiers. Both models independently agree on every video.
7. Where This Breaks Down & What Comes Next
The state-change approach works cleanly on these test videos. Here’s where it won’t.
Known Limitations
1. The person blocks the “before” or “after” state. If the delivery driver is still standing in front of the package in the last frame, the model can’t see whether a package is there. The algorithm depends on the endpoints showing a clear scene state. Mitigation: instead of comparing the first and last halves, find the frames with the minimum and maximum person presence and use the surrounding frames.
2. Multiple packages — partial theft. If there are 3 packages on the porch and someone steals 1, the package-presence score barely changes (still high in both early and late frames). The delta will be tiny and the system will likely classify it as “no event.” This requires object-level tracking, not scene-level classification.
3. The package is visually ambiguous. A brown cardboard box on a brown wooden porch. A small envelope. A clear bag. The text prompts are biased toward stereotypical Amazon boxes. Unusual package types may not register as “package present.”
4. Night footage / IR camera modes. All our test videos are daytime, well-lit footage. Porch cameras frequently operate in infrared at night, producing grayscale images with very different visual characteristics. The CLIP/SigLIP2 models were trained overwhelmingly on daytime RGB images. Performance on IR footage is an open question.
5. The synthetic theft video is too clean. Time-reversing a delivery produces a perfect inverse. Real theft looks different: the person may approach from a different angle, move faster, look around nervously, or partially occlude the camera. The temporal symmetry of our test data flatters the algorithm. After building Argus, we tested it on real theft videos and it successfully detected theft in all of the cases, except the ones where the package was visually ambiguous, meaning it was hard to tell if there was a package on the porch or not. In those cases, the model would output low confidence, and the system would not detect a state change.
Improving Within the Current Framework
Better frame selection. Instead of uniformly sampling 8 frames, use optical flow or frame differencing to find the “key moments” — the frames where the scene changes most. This would make the early/late comparison more robust when the action happens in the middle 20% of the clip.
Prompt ensembling with negatives. Add “distractor” prompts for common false positives: “a person standing on a porch,” “a doormat on a porch step,” “a dog on a porch.” This helps the softmax distinguish package presence from other porch-related objects.
Calibration. The current confidence score (0.5 + |delta| * 2.5) is a heuristic. With more test data, you could calibrate it against actual error rates to produce meaningful probabilities.
The VLM Path: Where This Should Eventually Go
The state-change approach is a clever hack around a fundamental limitation: CLIP-family models understand scenes, not events. They can tell you what’s in a frame but not what happened across frames. We worked around this by reducing “what happened” to “how did the scene state change” — but this only works because our two categories (delivery vs. theft) happen to correspond to opposite state transitions.
For a more general security system that needs to handle categories like:
- Someone checking the mail (not theft)
- A child retrieving their own package (not theft)
- Someone placing a different object (not a delivery)
- A delivery driver returning to take a photo (not theft)
…you need a model that actually understands temporal narrative.
Vision-Language Models (VLMs) are the right tool here. Models like:
| Model | Size | Approach |
|---|---|---|
| Qwen2.5-VL (7B/72B) | 7-72B | Processes video natively with dynamic resolution |
| InternVL 2.5 | 8-78B | Strong video QA, good on temporal reasoning |
| LLaVA-Video | 7-72B | Temporal pooling over video frames |
| Gemini 2.0 Flash | API | Native video input, multimodal reasoning |
These models can be prompted with natural language and produce natural language explanations:
Prompt: "Watch this 10-second porch camera clip. Describe what happens
step by step. Then classify: is this a delivery, a theft, or neither?
Explain your reasoning."
Response: "A person in a hoodie approaches the porch from the left.
There are two cardboard boxes already on the porch. The person bends
down, picks up both boxes, and walks away to the right. The porch is
empty at the end. This is a THEFT --- the person took packages that
were already there."
The advantages over our approach:
- Genuine temporal reasoning — the model understands the sequence of events, not just endpoint states
- Natural language explanations — you get an audit trail for why the system flagged something
- Generalization — new categories can be added by changing the prompt, not the code
- Context awareness — “the homeowner retrieved their own package” requires understanding that the same person lives there
The disadvantages:
- Speed — 7B+ parameter models take 10-30 seconds per clip, even on a good GPU. Not viable for edge deployment today.
- Cost — API-based VLMs (Gemini, GPT-4V) cost $0.01-0.05 per clip. At 50 events/day, that’s $0.50-2.50/day/camera.
- Hallucination — VLMs can confidently describe events that didn’t happen. For a security system, a hallucinated theft is worse than a missed one.
- Privacy — Sending porch camera footage to a cloud API raises obvious concerns.
The Hybrid Architecture
The practical path is a hybrid:
[Stage 1: YOLO edge model] -- runs on every frame, ~3ms
|
v
[Stage 2: SigLIP2 state-change] -- runs on clips, ~7s, LOCAL
|
+-- High confidence DELIVERY --> notify homeowner
|
+-- High confidence THEFT --> trigger siren immediately
|
+-- Low confidence / ambiguous -->
|
v
[Stage 3: VLM analysis] -- runs on ambiguous clips only
| ~15s, cloud or local 7B
v
[Detailed classification + explanation]
Stage 2 (our state-change detector) handles the easy 80% of cases — clear deliveries and obvious thefts — locally, fast, and privately. Only genuinely ambiguous events get escalated to the expensive VLM, which can apply deeper reasoning.
This is the architecture we’re building toward with Project Argus. The state-change detector isn’t the final answer — it’s a fast, reliable filter that keeps the expensive stuff from running on every motion event. And for a $50 porch camera, a zero-shot classifier that runs locally on a Raspberry Pi-class device and gets the obvious cases right is already more useful than what most commercial systems offer today.
A models run locally on Apple Silicon (MPS) with no training data required. Built with SigLIP2 and CLIP ViT-B/32 via HuggingFace Transformers.