Advertising Experience Quality — Longitudinal Research Portfolio

01 · Abstract

Why Advertising Experience Research Matters

Ad-supported streaming is no longer a niche product tier. Netflix's ad-supported plan reached approximately 94 million monthly active users in 2025, with U.S. members averaging around 41 hours of monthly viewing. By 2026, Netflix reported that its ads plan reached more than 250 million global monthly active viewers — confirming that this growth has already arrived, not merely projected. Note: Netflix's public advertising metrics shifted from "monthly active users" to "monthly active viewers" between these reporting periods; the latter reflects household-level viewership and should be interpreted as an indicator of scale rather than a directly comparable subscriber count. The rapid scale of this shift makes advertising experience quality a timely strategic research problem, not just a technical modeling exercise. How platforms balance monetization with member experience will be a defining competitive variable in the next phase of streaming growth.

As streaming platforms shift toward ad-supported models, understanding how advertising experience shapes member satisfaction and retention becomes a strategic research priority. Yet the core mechanism — how ad fatigue accumulates over time and at what point it begins to damage the viewing experience — remains poorly understood in a longitudinal context.

This demonstration builds a transparent, evidence-grounded research framework to answer three questions: (1) How do ad load and personalization independently affect member satisfaction over a 12-month period? (2) Does ad fatigue mediate the effect of ad load on churn intent? (3) Is the relationship between fatigue and satisfaction bidirectional, and if so, which direction dominates?

I generated a synthetic longitudinal panel (3,150 members × 12 months) using parameters calibrated to published research and industry benchmarks, then applied a full measurement and modeling pipeline including Confirmatory Factor Analysis, Generalized Estimating Equations, Weibull Survival Analysis, and Cross-Lagged Panel Modeling.

3,150

Simulated members

Monthly time points

37,800

Total observations

Factorial conditions

Survey items (6 constructs)

Analytical models

End-to-End Research Lifecycle

The Netflix Ads Experiences role calls for "expert navigation of the full research lifecycle from foundational and generative exploration to tactical execution." This portfolio is organized along that lifecycle. Each phase answers a distinct research question and feeds the next.

Phase 0 is anchored by my prior published computational qualitative work (NLP/topic modeling on large-scale social media data) — see About section. Phases 1–3 are demonstrated below.

03 · Evidence-Based Parameters

How Parameters Were Calibrated

Every numerical value in the simulation has an explicit justification. Parameters were either (a) taken directly from published estimates, (b) derived by back-calculation from a defensible target outcome, or (c) stated explicitly as design assumptions. No key parameter was left undocumented; each was sourced, calibrated, or explicitly treated as a modeling assumption subject to sensitivity analysis.

Calibration Philosophy

All major simulation parameters were either drawn directly from published sources, calibrated from public industry benchmarks via back-calculation, or explicitly labeled as modeling assumptions and evaluated through sensitivity analysis. Where no direct empirical estimate was available, the approach was to define a defensible target outcome, solve for the parameter value that produces it, and stress-test conclusions across a plausible range — making the simulation transparent and contestable rather than dependent on any single assumed value.

Source Hierarchy

Level	Source Type	Examples in This Study
Theoretical	Peer-reviewed academic basis	Ducoffe (1996); Li et al. (2002); Cho & Cheon (2004); Koyck (1954); Leone (1995) meta-analysis
Public company	Official Netflix communications	Netflix 2026 Upfront figures; Netflix Help Center ad-tier descriptions
Industry benchmark	Cross-industry research reports	Parks Associates 2024; Conviva 2023; Deloitte 2026; MediaPost
Directional	Agency/vendor blog estimates	Strategus 2025; Mynt Agency 2025 (used directionally, not as exact values)
Modeling assumption	Back-calculated or design choices	Diversity mitigation δ = 0.49; adstock→fatigue increment 0.28; initial fatigue distribution — all sensitivity-tested

Core Parameter Table

Parameter	Value	Justification	Source
Target monthly viewing	41 hrs	Netflix ad-tier U.S. average viewing reported across multiple industry sources	Marketing Brew / Netflix upfront (2023)
Default ad load	4.5 min/hr	Industry-reported estimate: multiple sources place Netflix's ad tier in the 4–5 min/hr range. Used as a design assumption; not a formally published Netflix specification.	Industry-reported estimate (MediaPost, Omdia, 2023–24); used as simulation baseline
Adstock carryover λ	0.47	Median λ from meta-analysis of 128 advertising studies	Leone (1995)
Fatigue persistence α	0.72	Estimated from wear-out literature showing fatigue persists across viewing sessions	Pechmann & Stewart (1988); Kronrod & Huber (2019)
Diversity mitigation coefficient	0.49	Back-calculated from 40% lifecycle extension benchmark for 3–4 creative variants. Equation: (1−m×0.30)/(1−m×0.80)=1.40 → m=0.49	Mynt Agency (2025); Strategus OTT benchmark (2025)
Adstock → fatigue increment	0.28	Back-calculated targeting fatigue steady-state ≈ 0.48 [0,1] at medium load. increment = 0.48×(1−α)/(adstock_norm_SS×div_factor)	Derived from Strategus (2025) OTT wear-out cycle
Personalization → relevance (β)	0.40	Personalized ads increase ad tolerance by ~40%	Google/Ipsos (2019)
Baseline churn rate	~2%/mo	SVOD annual churn 20–25% → monthly 1.7–2.1%. Logit intercept calibrated accordingly.	Parks Associates (2024)
Post-ad drop-off baseline	17%	Viewers abandoning content within 30s of mid-roll ad insertion	Conviva Streaming Benchmark (2023)
Ad-supported streaming share	68%	Proportion of streaming subscribers using ad-supported options	Deloitte Digital Media Trends (2026)

Theoretical Frameworks Integrated

Framework	Role in Model	Source
Advertising Value Model	Relevance, trust → satisfaction (+); intrusiveness, fatigue → satisfaction (−)	Ducoffe (1996); Brackett & Carr (2001)
Web Ad Intrusiveness Scale	Intrusiveness construct definition and item structure	Li, Edwards & Lee (2002)
Internet Ad Avoidance Model	Goal impediment, ad clutter, prior negative experience → avoidance behavior	Cho & Cheon (2004)
Adstock / Koyck Model	Carryover effect: adstock_t = exposure_t + λ × adstock_{t-1}	Koyck (1954); Leone (1995)
Ad Wear-out / Wearout Wearout	Fatigue accumulation and the short-term vs. long-term reversal pattern	Pechmann & Stewart (1988); Kronrod & Huber (2019)

Model Limitation Note

Kronrod & Huber (2019) show that advertising-induced irritation reverses over longer time horizons (short-term negative → long-term positive for brand preference as memory effects dominate). The current AR(1) fatigue model assumes monotonic accumulation within the 12-month window — appropriate for short-term experience research, but a known simplification. Extending to a non-linear decay model would be a natural next step with real data.

04 · Reproducible Code

Data Generation & Analysis Pipeline

All code is available below. The full pipeline runs in a single Python environment. Click any block to expand.

ads_experience_synthetic_v2.py — Core simulation engine (key excerpts) ▾

# Evidence-Informed Synthetic Longitudinal Data Generator
# Parameters calibrated from published literature — see Parameter Table above

from dataclasses import dataclass
import numpy as np

@dataclass
class EffectProfile:
    # All coefficients in 5-point scale absolute units (not standardized)
    # This ensures between-condition differences are visible in output
    name: str = "base"
    personalization_to_relevance: float = 1.20   # Google/Ipsos 2019: +40% tolerance
    diversity_fatigue_mitigation: float = 0.49   # Back-calc: 40% lifecycle extension
    adload_to_intrusiveness:      float = 0.28   # Li, Edwards & Lee (2002)
    adstock_to_fatigue_increment: float = 0.28   # Back-calc: SS fatigue ≈ 0.48 at 4.5min/hr
    relevance_to_satisfaction:    float = 0.40   # Ducoffe (1996); Kim & Han (2014)
    intrusiveness_to_satisfaction: float = -0.35
    fatigue_to_satisfaction:      float = -0.32
    trust_to_satisfaction:        float = 0.35
    continuance_to_churn_logodds: float = -0.80
    fatigue_to_churn_logodds:     float = 1.20   # Parks Associates 2024: ~2%/mo baseline

def simulate_one_scenario(cfg, ep):
    """
    Core longitudinal data generator.
    Time-varying dynamics:
      [1] Adstock: adstock_t = exposure_t + λ × adstock_{t-1}  [Koyck 1954]
      [2] Fatigue: AR(1) in bounded [0,1] state with variety mitigation
      [3] Latent constructs: generated in absolute 5-pt scale units
      [4] Behavioral outcomes: logistic models calibrated to benchmarks
    """
    fatigue_state = rng.uniform(0.02, 0.06, size=n)  # New subscriber: minimal prior fatigue
    adstock_state = np.zeros(n)

    for month in range(1, cfg.n_months + 1):
        # Adstock update (Koyck carryover)
        adstock_state = ad_exposure + cfg.adstock_lambda * adstock_state
        adstock_norm  = adstock_state / (adstock_state + 180.0)

        # Fatigue update: AR(1) with variety mitigation (Kronrod & Huber 2019)
        fat_increment = (ep.adstock_to_fatigue_increment * adstock_norm
                        * (1 - ep.diversity_fatigue_mitigation * diversity))
        fatigue_state = np.clip(cfg.fatigue_persistence * fatigue_state
                                + fat_increment + noise, 0, 1)

        # Satisfaction structural model (Ducoffe 1996)
        satisfaction_lat = (3.00
            + ep.relevance_to_satisfaction      * (relevance_lat    - 3.0)
            + ep.trust_to_satisfaction          * (trust_lat        - 3.0)
            + ep.intrusiveness_to_satisfaction  * (intrusiveness_lat - 2.5)
            + ep.fatigue_to_satisfaction        * (fatigue_lat      - 2.5)
            + re_satisfaction)

        # Churn intent: logistic, calibrated to ~2%/month (Parks Associates 2024)
        churn_logit = (-4.20
            + ep.continuance_to_churn_logodds * (continuance_lat - 3.0)
            + ep.fatigue_to_churn_logodds     * fatigue_state
            + 0.25 * price_sensitivity)

ads_analysis_pipeline.py — Full analysis pipeline (key excerpts) ▾

# ── CFA: Confirmatory Factor Analysis ──────────────────────────────
model_spec = """
Relevance     =~ relevance_1 + relevance_2 + relevance_3
Intrusiveness =~ intrusiveness_1 + intrusiveness_2 + intrusiveness_3
Trust         =~ trust_1 + trust_2 + trust_3
Fatigue       =~ fatigue_1 + fatigue_2 + fatigue_3
Satisfaction  =~ satisfaction_1 + satisfaction_2 + satisfaction_3
Continuance   =~ continuance_1 + continuance_2 + continuance_3
"""
model = semopy.Model(model_spec)
model.fit(df_month6[item_cols])
fit_stats = semopy.calc_stats(model)

# ── GEE: Generalized Estimating Equations ──────────────────────────
# GEE chosen over LMM because ad condition variables are constant
# within members (between-subject design), causing singularity in
# LMM's random effect estimation. GEE with exchangeable correlation
# structure provides valid inference for this design.
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Gaussian
from statsmodels.genmod.cov_struct import Exchangeable

gee_m2 = GEE.from_formula(
    "satisfaction_score ~ time_c + load_z + pers_z + fatigue_score",
    groups=df["member_id"], data=df,
    family=Gaussian(), cov_struct=Exchangeable()
).fit()

# ── Cross-Lagged Panel Model ─────────────────────────────────────── 
# Within-person demeaning removes between-subject confounds,
# isolating the lagged temporal relationship.
for col in ["fatigue_score", "satisfaction_score", "fat_lag1", "sat_lag1"]:
    m = df.groupby("member_id")[col].transform("mean")
    df[col + "_dv"] = df[col] - m

path_A = smf.ols("satisfaction_score_dv ~ fat_lag1_dv + sat_lag1_dv + time_c",
                 data=df).fit(cov_type="HC3")
path_B = smf.ols("fatigue_score_dv ~ sat_lag1_dv + fat_lag1_dv + time_c",
                 data=df).fit(cov_type="HC3")

# ── Survival Analysis ────────────────────────────────────────────── 
# Weibull AFT for covariate inference; Kaplan-Meier for visualization.
waf = WeibullAFTFitter(penalizer=0.01)
waf.fit(surv_df, duration_col="duration", event_col="event")

Full source code: ads_experience_synthetic_v2.py (simulation engine) · ads_analysis_pipeline.py (analysis pipeline) · Python 3.12 · statsmodels 0.14.6 · lifelines 0.30.3 · semopy 2.3.11

📦 Full reproducible code available upon request. The complete simulation engine (~400 lines) and analysis pipeline (~350 lines) include all parameter definitions with inline source citations, factorial design loop, sensitivity sweep, and diagnostic visualizations. Please reach out directly for the full codebase.

05 · Measurement Model

Confirmatory Factor Analysis

What Each Construct Measures

Construct	Definition	Theoretical Basis
Relevance	Perceived informativeness and personal relevance of ads shown	Ducoffe (1996) Ad Value Model
Intrusiveness	Perceived interruption and goal impediment from ad insertion. Distinct from irritation — it is the cognitive perception that the ad has disrupted the viewing flow, regardless of emotional valence.	Li, Edwards & Lee (2002)
Trust	Perceived credibility and trustworthiness of ads shown	Choi & Rifon (2002)
Ad Fatigue	Accumulated sense of weariness from repeated ad exposure over time (time-varying)	Pechmann & Stewart (1988)
Satisfaction	Overall satisfaction with the viewing experience during the period	Ducoffe (1996); Brackett & Carr (2001)
Continuance Intent	Intention to continue using the service in the coming months	Bhattacherjee (2001)

Model Fit Indices

1.000

CFI (≥ .95 ✓)

1.000

TLI (≥ .95 ✓)

.000

RMSEA (≤ .06 ✓)

.996

GFI (≥ .90 ✓)

.718

χ²(120) p-value

χ²(120) = 110.67, p = .718. A non-significant χ² indicates good model-data fit. Note on perfect fit indices: CFI = 1.000 and RMSEA = .000 are mathematically inevitable when the data-generating process exactly mirrors the assumed factor structure — as it does here by construction. This is a known limitation of evaluating measurement models on synthetic data. In practice, real survey data from streaming members would introduce correlated residuals (adjacent items sharing method variance), measurement drift over time, and respondent heterogeneity. A realistic expectation for a well-designed real study would be CFI ≈ .95–.98 and RMSEA ≈ .03–.06. The Trust construct's weaker performance (AVE = .387) is the more meaningful finding here — it is not an artifact of the simulation, but reflects genuine multi-dimensionality in how members conceptualize ad credibility.

Factor Loadings & Convergent Validity

Construct	Items	Std. Loading Range	AVE	CR	α	Verdict
Satisfaction	3	.842 – .847	.712	.881	.881	✓ Strong
Fatigue	3	.820 – .833	.685	.867	.867	✓ Strong
Intrusiveness	3	.782 – .789	.618	.829	.829	✓ Good
Continuance	3	.762 – .776	.592	.813	.814	✓ Good
Relevance	3	.718 – .736	.529	.771	.771	✓ Adequate
Trust	3	.596 – .651	.387	.655	.654	△ Weak

AVE = Average Variance Extracted (threshold ≥ .50); CR = Composite Reliability (threshold ≥ .70). Trust falls below both thresholds, indicating its three items do not converge sufficiently on a single factor. This likely reflects the multi-dimensional nature of ad trust (brand trust vs. data privacy trust vs. ad content credibility). A revised instrument would separate these facets.

Discriminant Validity (Fornell-Larcker)

Each diagonal cell = AVE. Off-diagonal cells = squared inter-factor correlation (r²). Discriminant validity holds when the diagonal exceeds all values in the same row and column.

	Relevance	Intrusiveness	Trust	Fatigue	Satisfaction	Continuance

The Satisfaction ↔ Continuance pair fails the criterion (r² = .830 exceeds both AVEs of .712 and .592). This is expected — in consumer research, overall satisfaction and continuance intention are conceptually very close constructs. Practically, they can be treated as a combined "loyalty" dimension. Trust also fails against Satisfaction (r² = .428 > AVE .387), consistent with its convergent validity weakness.

06 · Analysis Results

Longitudinal Analysis Results

6.1 — Satisfaction & Fatigue Trajectories

Across all conditions, satisfaction declines monotonically from month 1 to month 12 as ad fatigue accumulates. The effect is most pronounced under high ad load (6.5 min/hr), where average satisfaction drops from 2.84 to 2.20 — a 0.64-point decline on the 5-point scale.

Member Satisfaction Over 12 Months — by Ad Load

Ad Fatigue Accumulation Over 12 Months — by Ad Load

Reading the Charts

Satisfaction and fatigue move in opposite directions — and the gap between low and high ad load conditions widens over time. This reflects the compounding nature of the AR(1) fatigue process: early differences are small, but by month 6–8 the system approaches steady state and divergence is maximized.

6.2 — GEE Longitudinal Model

Generalized Estimating Equations were used to model repeated satisfaction measurements while accounting for within-member correlation (exchangeable working correlation structure). Three nested models were compared using QIC (Quasi-likelihood under the Independence model Criterion).

Model	Predictors	QIC	ΔQIC	Verdict
M1 Basic	Time + Ad Load + Personalization	34,681	—
M2 + Fatigue	M1 + Fatigue Score	34,679	−2	✓ Selected
M3 + Interaction	M2 + Load × Personalization	34,688	+7	Not supported

M2 Coefficient Table

Predictor	β	SE	z	p	Interpretation
Intercept	3.734	.017	215.9	***	Baseline satisfaction at average conditions
Time (centered)	−0.031	.001	−10.8	***	Satisfaction declines ~0.03 points per month
Ad Load (z-scored)	−0.353	.009	−38.9	***	1 SD increase in ad load (= 2 min/hr) → −0.35 satisfaction
Personalization (z-scored)	+0.200	.009	+22.8	***	1 SD increase in personalization → +0.20 satisfaction
Fatigue Score	−0.314	.006	−56.6	***	1-point fatigue increase → −0.31 satisfaction
Load × Personalization	−0.001	.011	−0.08	.933	Interaction not significant (M3 not supported)

*** p < .001. β values are unstandardized GEE estimates. Load and Personalization z-scored (SD=2 min/hr and SD=0.25 respectively) for comparability.

GEE Coefficients (M2) — Effect on Satisfaction

6.3 — Factorial Comparison at Month 12

Two-way ANOVA on month-12 satisfaction confirms large independent effects of both ad load (η² = .297) and personalization (η² = .083), with no significant interaction (η² = .000, p = .855). Ad load and personalization operate as additive, independent levers.

Month-12 Satisfaction Heatmap — Ad Load × Personalization Level

The range from worst condition (high load + low personalization: 1.91) to best condition (low load + high personalization: 3.56) spans 1.65 points on the 5-point scale — a substantial effect with meaningful real-world implications for member experience.

6.4 — Survival Analysis: Churn Intent

A Weibull AFT model regressed time-to-first-churn-intent on member-level average fatigue, satisfaction, ad load, personalization, and price sensitivity. Key finding: ad load does not directly predict when members develop churn intent. Fatigue and satisfaction do.

Predictor	Coef	exp(coef)	p	Interpretation
Mean Fatigue	−0.301	.740	***	Higher fatigue → churn intent arrives 26% sooner
Mean Satisfaction	+0.289	1.336	***	Higher satisfaction → churn intent delayed 34%
Price Sensitivity	−0.255	.775	**	Price-sensitive members reach churn intent sooner
Ad Load	+0.008	1.008	.634	Not significant — effect is fully mediated by fatigue
Personalization	+0.005	1.005	.961	Not significant — effect mediated by satisfaction/fatigue

Key Mediation Finding

Ad load predicts churn intent only indirectly — its effect operates through fatigue accumulation. This is the Weibull AFT equivalent of a full mediation pattern: when fatigue is in the model, the direct ad load coefficient is near zero and non-significant. Practically, this means the most powerful intervention for reducing churn is not reducing ad load per se, but managing the experience of ad load through fatigue mitigation strategies (creative variety, personalization, natural break placement).

Monthly Churn Intent Rate Over 12 Months — by Ad Load

6.5 — Cross-Lagged Panel Model

To test the temporal directionality of the fatigue–satisfaction relationship, I applied a within-person demeaned cross-lagged model. This removes all stable between-person differences (including condition effects) and tests whether changes in one variable precede changes in the other.

Path	β	SE	t	p	R²
Path A: Fatigue_t−1 → Satisfaction_t	−0.098	.006	−16.79	***	.058
Path B: Satisfaction_t−1 → Fatigue_t	−0.034	.005	−6.88	***	.253

Both paths are statistically significant, confirming a bidirectional relationship. However, Path A is approximately 3× stronger than Path B (|−0.098| vs |−0.034|). Within the simulated longitudinal model, fatigue emerges as the dominant temporal antecedent of satisfaction — members who feel more fatigued in month t−1 report meaningfully lower satisfaction in month t, even after controlling for their prior satisfaction level. This pattern would need to be replicated in real member data before treating it as a generalizable finding.

07 · Phase 3 — Experimental Validation

Product Experiment: Creative Rotation Frequency

The longitudinal phase identified a clear product opportunity: creative diversity reduces fatigue accumulation, and fatigue mediates the ad-load-to-churn pathway. But correlational evidence — even causal-ordered evidence — has limits for product decisions. The next question is sharper and more actionable:

Research Question

If we increase creative rotation frequency, does the resulting fatigue reduction translate into measurable improvements in member experience and behavior?

This is the kind of question only a controlled experiment can answer. The longitudinal model says "yes, in theory." Product needs to know "yes, on real members, at this magnitude, for these segments, over this timeframe."

7.1 — Proposed Pre-analysis Plan

Note: This is an illustrative pre-analysis plan demonstrating how I would lock in design decisions before running the experiment. It is not registered with an external preregistration repository (e.g., OSF, AsPredicted); in production this plan would be formally registered.

Element	Specification
Primary hypothesis	Arm B (1-week rotation) reduces fatigue vs. Arm A (2-week control) over 4-week treatment period
Design	3-arm parallel trial — A (2-wk, control), B (1-wk, faster), C (4-wk, monotonicity check)
Sample	4,500 members (1,500 per arm); pre-treatment fatigue stratified randomization recommended
Duration	8 weeks: 4 pre-treatment baseline + 4 post-treatment exposure
Primary outcome	Weekly self-reported fatigue (5-point scale)
Secondary outcomes	Satisfaction · Episode completion · Churn intent
Primary analysis	Difference-in-Differences via GEE with exchangeable working correlation
Variance reduction	CUPED using pre-treatment fatigue baseline
Multiple testing	Benjamini-Hochberg FDR across 4 outcomes (q < 0.05)
Stopping rule	No interim analyses; full 8-week sample analyzed once

7.2 — Power Analysis (Simulation-Based)

Rather than relying on closed-form approximations that assume independence (which doesn't hold in this longitudinal design), I conducted simulation-based power analysis: generate 100 synthetic trials at each candidate sample size, fit the planned DiD model, and count the proportion of trials that detect the effect at α = 0.05.

Power Curve · 100 Monte Carlo Simulations per Sample Size

Conclusion: 300 members per arm achieves 96% power for the expected effect size. The actual study uses 1,500/arm for comfortable margin and to enable subgroup analyses.

7.3 — Validation Checks

Two checks before touching the outcomes:

Sample Ratio Mismatch (SRM)

The classic A/B test pitfall: silent assignment bugs that cause the arms to be unbalanced. A chi-square test on observed vs. expected assignment proportions: χ² = 0.00, p = 1.00 → PASS. No assignment irregularities.

Baseline Balance (with SMD)

A subtle methodological point worth highlighting: with 4,500 members, traditional F-tests on baseline covariates can show "statistical significance" even when the actual differences are practically negligible. This is a Type I error inflation at large N — the test is too sensitive. Best practice is to report Standardized Mean Differences (SMD) instead, with conventional thresholds: |SMD| < 0.10 indicates negligible imbalance.

Covariate	F-test p	SMD (B–A)	SMD (C–A)	max \|SMD\|	Verdict
Pre-treatment fatigue	.006	+0.041	−0.016	0.041	✓ negligible
Pre-treatment satisfaction	.031	+0.040	−0.003	0.040	✓ negligible
Ad tolerance	.009	−0.019	+0.036	0.036	✓ negligible
Price sensitivity	.004	+0.060	+0.023	0.060	✓ negligible

All F-tests are "significant" by p-value, but all SMDs are well below the 0.10 threshold for meaningful imbalance. This is the large-sample randomization-check artifact — reporting only F-tests here would be misleading.

7.4 — Primary Analysis: Difference-in-Differences

The DiD design isolates the treatment effect from any pre-existing differences between arms. Coefficient estimates are the period × arm interaction, fitted via GEE with exchangeable working correlation to handle within-member dependence.

Contrast	DiD Effect	SE	95% CI	p	Interpretation
Arm B vs Arm A 1-wk vs 2-wk rotation	−0.125	0.009	[−0.142, −0.109]	<.001	Faster rotation reduces fatigue by 0.125 points on 5-pt scale
Arm C vs Arm A 4-wk vs 2-wk rotation	+0.118	0.008	[+0.101, +0.134]	<.001	Slower rotation increases fatigue by 0.118 points — monotonicity confirmed

Monotonicity Check

The symmetric pattern of arms B and C around the control (one reduces fatigue by ~0.13, the other increases by ~0.12) is methodologically important: it confirms the effect is dose-responsive in the diversity dimension, not a quirk of any single comparison. This is exactly what arm C was designed to test.

7.5 — CUPED Variance Reduction

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique from Microsoft (Deng et al., 2013). The idea: members who are naturally higher on the outcome at baseline will also be higher post-treatment. By adjusting each member's post-treatment outcome for their baseline, we strip out predictable variance and tighten the treatment effect estimate.

Treatment Effect Estimate · Naive vs. CUPED-Adjusted

Method	Effect	SE	95% CI Width
Naive t-test	−0.114	0.0078	0.0305
CUPED-adjusted	−0.118	0.0073	0.0286
Improvement	12.6% variance reduction → same precision with ~13% smaller sample

7.6 — Heterogeneous Treatment Effects

Average treatment effects can mask important variation across member segments. By splitting members into pre-treatment fatigue tertiles, we can ask: which members benefit most from faster rotation?

Treatment Effect by Pre-treatment Fatigue Tertile (Arm B vs A)

All three segments show statistically significant benefit, but the Mid-fatigue segment shows the largest effect (β = −0.136). This is the actionable nuance: low-fatigue members have less room to improve, and high-fatigue members may have other drivers (e.g., long-term subscriber frustration we saw in the survival analysis). The Mid-fatigue segment — members who are starting to feel the load but haven't given up — is the optimal target for this intervention.

7.7 — Secondary Outcomes & FDR Correction

Running multiple outcome tests inflates the Type I error rate. The Benjamini-Hochberg procedure controls the false discovery rate (FDR) — the expected proportion of false positives among rejected nulls — at q < 0.05.

Outcome	Effect	Raw p	BH q-value	FDR-Sig?
Fatigue (primary)	−0.125	<.001	<.001	✓ Yes
Satisfaction	+0.056	<.001	<.001	✓ Yes
Churn intent	+0.107	.541	.639	— No
Episode completion	+0.041	.639	.639	— No

Interpretation of Mixed Results

Fatigue and satisfaction respond immediately and substantially. Churn intent and completion do not — at least not in 4 weeks. This is consistent with the longitudinal phase: behavioral consequences accumulate over months, not weeks. A 4-week experiment can validate the upstream mechanism (attitudinal response) but cannot validate the downstream business impact (retention). A longer-duration study or a behavioral surrogate would be needed for that.

7.8 — Product Recommendations

Translating findings into actionable product strategy is where research either earns trust or doesn't. Based on this experiment, my recommendations to the product team would be:

Adopt faster creative rotation as default Moving from 2-week to 1-week rotation reduces fatigue accumulation by ~25% relative to baseline trajectory. The CUPED-adjusted effect is precise (CI: [−0.13, −0.10]). Costs of more frequent creative refresh should be weighed against this experience gain.

Target the Mid-fatigue segment for highest ROI If full rollout has cost constraints, prioritize members in the middle fatigue tertile — they show the largest treatment response and represent the segment most likely on the trajectory toward problematic experience.

Extend to a longer-duration follow-up to validate retention impact Attitudinal improvements are confirmed; behavioral (churn) impact requires a longer observation window. Recommend a 16-week or 26-week extended experiment with a longitudinal hold-out to measure downstream retention.

Build into the ongoing measurement system The instrumentation here — fatigue, satisfaction, completion, churn intent — should be operationalized as ongoing tracking. Rotation cadence experiments are unlikely to be the only relevant intervention; the next likely candidates are placement (mid-roll vs. natural-break) and personalization quality.

08 · Key Findings

Summary of Demonstration Findings

All findings below are derived from synthetic data generated under the modeling assumptions documented in §3. They illustrate what the analytic pipeline would reveal under those assumptions, not validated claims about real Netflix members. Each pattern is a testable hypothesis for a future study with real member data, not a generalizable conclusion.

In the Synthetic Framework, Fatigue Mediates the Ad Load → Churn Intent Pathway Under the specified data-generating assumptions, ad load does not directly predict churn intent timing (Weibull AFT, p = .634). Its effect operates through fatigue accumulation. The implication this framework would help test in a real study: managing the fatigue experience (creative variety, personalization, natural break placement) may be a more tractable lever than reducing ad inventory.

Ad Load and Personalization Appear as Independent, Additive Levers The interaction between ad load and personalization was non-significant in this simulation (β = −0.001, p = .933). Both factors show large independent effects (η² = .297 and .083 respectively). This framework would allow product and monetization teams to test whether the two levers indeed operate independently in real-member data, or whether genuine interactions emerge that the synthetic model does not capture.

The Simulated Longitudinal Model Identifies Fatigue as the Dominant Temporal Antecedent of Satisfaction Within this synthetic framework, cross-lagged analysis shows fatigue at t−1 predicting satisfaction at t (β = −0.098, p < .001) about 3× more strongly than the reverse path (β = −0.034). This is a temporal-ordering pattern produced by the simulation, not a causal claim — establishing whether fatigue actually precedes satisfaction in real data would require a designed study with appropriate identification strategy.

The Simulated Satisfaction Decline Emerges Early and Persists Under high ad load (6.5 min/hr) in the synthetic model, satisfaction drops 0.22 points in the first two months alone (from 2.84 to 2.69) and continues declining to 2.20 by month 12. The framework suggests that — if real-member data show a similar pattern — experience interventions may be most impactful early in the membership lifecycle, before fatigue reaches the plateau of its AR(1) steady state.

Measurement Diagnostic: Trust Construct Requires Refinement The Trust construct showed weak convergent validity in the demonstration (AVE = .387, below the .50 threshold). Importantly, this is a measurement diagnostic example — not an empirical finding about real ad-trust attitudes. It illustrates the kind of psychometric issue a careful CFA can surface: a revised instrument would distinguish brand trust, data privacy comfort, and ad content credibility as separate facets.

09 · Business Implications

Implications for Streaming Ad Experience Research

For Product & Ads Experience Teams

If the mediation pattern observed in this simulation replicates in real-member data, it would have a direct product implication: optimizing fatigue management may be a more tractable target than ad load reduction, because it can be addressed through product levers (break placement, creative rotation frequency, personalization quality) that don't directly reduce ad inventory or revenue. A 40% reduction in fatigue accumulation through creative diversity — estimated from OTT industry benchmarks — could meaningfully extend the time before members reach dissatisfaction thresholds. This is a hypothesis the framework is designed to test, not a validated effect.

For Measurement Design

Ad fatigue should be measured as a time-varying construct within longitudinal survey designs, not a single cross-sectional measure. The AR(1) dynamics observed here suggest that surveys administered at months 3–4 will underestimate eventual steady-state fatigue, while surveys administered after month 8 will better capture the plateau. Diary study designs or embedded in-app pulse surveys (1–2 items per session) would provide higher temporal resolution than traditional monthly surveys.

For Research Methodology

This demonstration illustrates the value of connecting attitudinal survey data (what members say) with behavioral signals (what they do: drop-off rates, episode completion, churn). The Weibull AFT finding — that attitudinal fatigue predicts churn timing more directly than ad load — would be impossible to detect from behavioral data alone, and invisible from survey data without longitudinal follow-up.

Research Agenda Suggestion

The most impactful next study would be a 6-month longitudinal diary study with embedded behavioral linkage: monthly 5-item pulse surveys (fatigue, relevance, intrusiveness) linked to individual-level streaming logs. The synthetic framework developed here provides a starting measurement instrument, expected effect sizes for power analysis, and an analysis plan — ready to be adapted and recalibrated with real member data, not deployed as-is.

Limitations of This Demonstration

Limitation	Impact	How to Address in Real Study
Synthetic data cannot capture real member heterogeneity	Effect sizes may differ substantially	Pilot survey (n≈200) to recalibrate parameters
AR(1) fatigue model assumes monotonic accumulation	Underestimates long-term fatigue decay (Kronrod & Huber reversal)	Extended observation window (>18 months) with non-linear growth model
Trust construct has weak AVE (.387)	Trust-related path coefficients are attenuated	Revise to distinguish brand trust, privacy comfort, content credibility
Single-period condition assignment (members in one condition throughout)	Cannot test within-person condition changes	Crossover design or randomized condition change at mid-study
No content type moderator	Fatigue may differ by content genre (sports vs. drama vs. series)	Add content-type as a moderator in the structural model

10 · About This Work

Researcher Profile & Purpose of This Demonstration

Why I Built This

This project is designed as a methodological bridge: it demonstrates how I would enter a new product domain, translate business questions into measurable constructs, generate evidence-informed synthetic data, and prepare a validated analysis plan — before working with real member-level data.

My research background is in educational psychology, psychometrics, and quantitative methods — not advertising or consumer product research. Rather than claiming experience I don't have, I chose to demonstrate how I would approach an unfamiliar product domain: by grounding the problem in published theory, building a transparent and testable measurement framework, and applying longitudinal methods appropriate to the research question.

In short: I can translate a product question into a measurable longitudinal research design, build an evidence-informed synthetic prototype before accessing internal data, and identify which behavioral and attitudinal signals should be validated with real member data. That is what this portfolio demonstrates.

This framework — evidence-informed synthetic data, factorial sensitivity analysis, and a full CFA → GEE → Survival → CLPM pipeline — represents how I would design and analyze a first study before having access to real member data.

Core Methodological Competencies

✓	Survey instrument design & psychometric validation
✓	Confirmatory Factor Analysis (CFA) & SEM
✓	Longitudinal modeling (GEE, LMM, growth models)
✓	Survival / event history analysis
✓	Cross-lagged panel models & temporal ordering in longitudinal observational data
✓	Factorial design & sensitivity analysis
✓	Simulation-based research design
✓	Python (statsmodels, lifelines, semopy, scikit-learn)
✓	R (lavaan, lme4, survival) — for cross-validation
→	Conjoint analysis & MaxDiff (actively developing)

Synthetic Data Disclosure (Full)

This entire report is based on computationally generated synthetic data. No Netflix member data, internal metrics, proprietary research, or confidential information was used or referenced. Parameter values were derived exclusively from (a) peer-reviewed academic literature, (b) publicly available industry reports, and (c) mathematical back-calculation from published benchmarks. This work is intended solely as a methodological portfolio demonstration.

11 · References

Selected References

Theoretical frameworks, empirical benchmarks, and industry sources used in parameter calibration and model design.

Reference	Role in This Study
Bhattacherjee, A. (2001). Understanding information systems continuance: An expectation-confirmation model. MIS Quarterly, 25(3), 351–370.	Theoretical basis for Continuance Intention construct
Brackett, L. K., & Carr, B. N. (2001). Cyberspace advertising vs. other media: Consumer vs. mature student attitudes. Journal of Advertising Research, 41(5), 23–32.	Ad value → satisfaction path coefficients
Cho, C.-H., & Cheon, H. J. (2004). Why do people avoid advertising on the internet? Journal of Advertising, 33(4), 89–97.	Ad avoidance model; intrusiveness → churn pathways
Choi, S. M., & Rifon, N. J. (2002). Antecedents and consequences of web advertising credibility. Journal of Interactive Advertising, 3(1), 12–24.	Trust construct theoretical basis
Conviva. (2023). State of Streaming. Conviva Inc.	Post-ad drop-off baseline (~17%); behavioral benchmarks
Deloitte. (2026). Digital Media Trends Survey. Deloitte Insights.	Ad-supported streaming adoption rate (~68% of subscribers)
Ducoffe, R. H. (1996). Advertising value and advertising on the web. Journal of Advertising Research, 36(5), 21–35.	Advertising Value Model; relevance/trust → satisfaction (+), intrusiveness → satisfaction (−)
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50.	AVE and CR thresholds; Fornell-Larcker discriminant validity criterion
Kim, Y. J., & Han, J. (2014). Why smartphone advertising attracts customers: A model of Web advertising, flow, and personalization. Computers in Human Behavior, 33, 256–269.	Mobile advertising meta-analytic path coefficient estimates
Kronrod, A., & Huber, J. (2019). Ad wearout wearout: How time can reverse the negative effect of frequent advertising repetition on brand preference. International Journal of Research in Marketing, 36(2), 306–324.	Fatigue persistence estimates; short-term vs. long-term reversal pattern; model limitation
Koyck, L. M. (1954). Distributed Lags and Investment Analysis. North-Holland.	Adstock carryover model structure: adstock_t = exposure_t + λ × adstock_{t-1}
Leone, R. P. (1995). Generalizing what is known about temporal aggregation and advertising carryover. Marketing Science, 14(3), G141–G150.	Adstock λ = 0.47 (meta-analytic median across 128 studies)
Li, H., Edwards, S. M., & Lee, J.-H. (2002). Measuring the intrusiveness of advertisements: Scale development and validation. Journal of Advertising, 31(2), 37–47.	Intrusiveness construct definition, scale structure, and items
Mynt Agency. (2025). Predicting Ad Fatigue: When to Refresh Your Creative. Retrieved from articles.myntagency.com	40% lifecycle extension from creative rotation; TV refresh cycle benchmarks
Netflix. (2026, May). Netflix Upfront 2026 Presentation. Netflix, Inc.	250M+ global monthly active viewers on ad-supported plan (2026)
Parks Associates. (2024). OTT Video Market Tracker. Parks Associates.	SVOD annual churn rate ~20–25%; monthly baseline ~1.9% calibration
Pechmann, C., & Stewart, D. W. (1988). Advertising repetition: A critical review of wearin and wearout. Current Issues and Research in Advertising, 11(1-2), 285–329.	Fatigue persistence across sessions; wear-out dynamics
Reuters. (2025). Netflix ad-supported tier reaches 94 million monthly active users. Reuters.	2025 ad-tier MAU baseline (94 million)
Strategus. (2025). What is Ad Creative Fatigue & How to Minimize it on OTT/CTV Channels. Retrieved from strategus.com	OTT creative wear-out cycle 3–4 weeks; diversity mitigation parameter calibration