Portfolio Demonstration · Synthetic Research

Measuring Advertising Experience Quality in Streaming:
A Longitudinal Research Framework

An evidence-informed synthetic data demonstration showing how attitudinal, behavioral, and longitudinal measurement methods can be applied to advertising experience research on streaming platforms.

What This Demonstration Shows

I built this portfolio to demonstrate how I would enter an unfamiliar product domain without overclaiming prior advertising experience. The page shows that I can:

  1. translate an ads-experience product question into measurable constructs,
  2. generate evidence-informed synthetic longitudinal data from public benchmarks and theory,
  3. validate measurement quality and estimate temporal behavioral pathways, and
  4. convert findings into testable product recommendations.
Seungman Kim, PhDEducational Psychology · Statistics & Measurement
Senior Research Associate, TTUHSC School of Nursing
Methods UsedCFA · GEE · Survival Analysis
Cross-Lagged Panel Model · Factorial ANOVA
DataSynthetic longitudinal panel
3,150 members × 12 months
ToolsPython · semopy · statsmodels
lifelines · Chart.js
This report uses entirely synthetic data generated from publicly available industry benchmarks and peer-reviewed literature. No Netflix internal data was accessed or approximated. This demonstration is not intended to estimate Netflix's actual advertising outcomes; rather, it shows how I would structure, simulate, and validate a longitudinal research design before applying it to real member-level data.

Why Advertising Experience Research Matters

Ad-supported streaming is no longer a niche product tier. Netflix's ad-supported plan reached approximately 94 million monthly active users in 2025, with U.S. members averaging around 41 hours of monthly viewing. By 2026, Netflix reported that its ads plan reached more than 250 million global monthly active viewers — confirming that this growth has already arrived, not merely projected. Note: Netflix's public advertising metrics shifted from "monthly active users" to "monthly active viewers" between these reporting periods; the latter reflects household-level viewership and should be interpreted as an indicator of scale rather than a directly comparable subscriber count. The rapid scale of this shift makes advertising experience quality a timely strategic research problem, not just a technical modeling exercise. How platforms balance monetization with member experience will be a defining competitive variable in the next phase of streaming growth.

As streaming platforms shift toward ad-supported models, understanding how advertising experience shapes member satisfaction and retention becomes a strategic research priority. Yet the core mechanism — how ad fatigue accumulates over time and at what point it begins to damage the viewing experience — remains poorly understood in a longitudinal context.

This demonstration builds a transparent, evidence-grounded research framework to answer three questions: (1) How do ad load and personalization independently affect member satisfaction over a 12-month period? (2) Does ad fatigue mediate the effect of ad load on churn intent? (3) Is the relationship between fatigue and satisfaction bidirectional, and if so, which direction dominates?

I generated a synthetic longitudinal panel (3,150 members × 12 months) using parameters calibrated to published research and industry benchmarks, then applied a full measurement and modeling pipeline including Confirmatory Factor Analysis, Generalized Estimating Equations, Weibull Survival Analysis, and Cross-Lagged Panel Modeling.

3,150
Simulated members
12
Monthly time points
37,800
Total observations
9
Factorial conditions
18
Survey items (6 constructs)
4
Analytical models

End-to-End Research Lifecycle

The Netflix Ads Experiences role calls for "expert navigation of the full research lifecycle from foundational and generative exploration to tactical execution." This portfolio is organized along that lifecycle. Each phase answers a distinct research question and feeds the next.

GENERATIVE EVALUATIVE PHASE 0 Foundational Exploration Computational NLP on social media data (LDA · sentiment) Prior published work PHASE 1 Construct Measurement Survey design CFA · IRT Reliability/validity This portfolio §5 PHASE 2 Longitudinal Observation GEE · Survival Cross-lagged Temporal ordering This portfolio §6 PHASE 3 Experimental Validation A/B/C product test CUPED · HTE · FDR Product decisions This portfolio §7 What language do members use about advertising? Are our constructs valid and reliable? How does ad experience evolve over time? Does intervention X actually work?

Phase 0 is anchored by my prior published computational qualitative work (NLP/topic modeling on large-scale social media data) — see About section. Phases 1–3 are demonstrated below.

Research Framework & Design

Factorial Structure

The design crosses three levels of ad load (2.5, 4.5, 6.5 min/hr) with three levels of personalization (low, medium, high), yielding 9 experimental conditions with 350 simulated members each. A third factor, ad diversity (creative variety), was varied within conditions to assess its fatigue-mitigating effect.

Longitudinal Dynamics

Unlike cross-sectional designs, this framework models two time-dependent processes that make ad experience fundamentally different from a single-exposure study:

Adstock Carryover (Koyck, 1954) Advertising effects carry over across months. Each month's cumulative ad exposure is: adstock_t = exposure_t + λ × adstock_{t-1}, where λ = 0.47 (Leone 1995 meta-analysis median). This means the "weight" of advertising compounds — members who have seen many ads feel their effect more strongly.
Fatigue Accumulation (AR-1 with mitigation) Ad fatigue accumulates as a first-order autoregressive process with a persistence coefficient α = 0.72 (from Pechmann & Stewart, 1988; Kronrod & Huber, 2019). Ad variety (creative rotation) reduces the monthly increment — calibrated from OTT industry benchmarks showing 3–4 creative variants extend ad lifecycle by ~40%.

Structural Model

The theoretical backbone integrates three published frameworks:

Ad Load Personalization Relevance Intrusiveness Trust Ad Fatigue (AR-1) Satisfaction Continuance Intent Churn Intent Adstock + AR(1) λ=0.47, α=0.72

How Parameters Were Calibrated

Every numerical value in the simulation has an explicit justification. Parameters were either (a) taken directly from published estimates, (b) derived by back-calculation from a defensible target outcome, or (c) stated explicitly as design assumptions. No key parameter was left undocumented; each was sourced, calibrated, or explicitly treated as a modeling assumption subject to sensitivity analysis.

Calibration Philosophy
All major simulation parameters were either drawn directly from published sources, calibrated from public industry benchmarks via back-calculation, or explicitly labeled as modeling assumptions and evaluated through sensitivity analysis. Where no direct empirical estimate was available, the approach was to define a defensible target outcome, solve for the parameter value that produces it, and stress-test conclusions across a plausible range — making the simulation transparent and contestable rather than dependent on any single assumed value.
Source Hierarchy
LevelSource TypeExamples in This Study
TheoreticalPeer-reviewed academic basisDucoffe (1996); Li et al. (2002); Cho & Cheon (2004); Koyck (1954); Leone (1995) meta-analysis
Public companyOfficial Netflix communicationsNetflix 2026 Upfront figures; Netflix Help Center ad-tier descriptions
Industry benchmarkCross-industry research reportsParks Associates 2024; Conviva 2023; Deloitte 2026; MediaPost
DirectionalAgency/vendor blog estimatesStrategus 2025; Mynt Agency 2025 (used directionally, not as exact values)
Modeling assumptionBack-calculated or design choicesDiversity mitigation δ = 0.49; adstock→fatigue increment 0.28; initial fatigue distribution — all sensitivity-tested

Core Parameter Table

ParameterValueJustificationSource
Target monthly viewing41 hrsNetflix ad-tier U.S. average viewing reported across multiple industry sourcesMarketing Brew / Netflix upfront (2023)
Default ad load4.5 min/hrIndustry-reported estimate: multiple sources place Netflix's ad tier in the 4–5 min/hr range. Used as a design assumption; not a formally published Netflix specification.Industry-reported estimate (MediaPost, Omdia, 2023–24); used as simulation baseline
Adstock carryover λ0.47Median λ from meta-analysis of 128 advertising studiesLeone (1995)
Fatigue persistence α0.72Estimated from wear-out literature showing fatigue persists across viewing sessionsPechmann & Stewart (1988); Kronrod & Huber (2019)
Diversity mitigation coefficient0.49Back-calculated from 40% lifecycle extension benchmark for 3–4 creative variants. Equation: (1−m×0.30)/(1−m×0.80)=1.40 → m=0.49Mynt Agency (2025); Strategus OTT benchmark (2025)
Adstock → fatigue increment0.28Back-calculated targeting fatigue steady-state ≈ 0.48 [0,1] at medium load. increment = 0.48×(1−α)/(adstock_norm_SS×div_factor)Derived from Strategus (2025) OTT wear-out cycle
Personalization → relevance (β)0.40Personalized ads increase ad tolerance by ~40%Google/Ipsos (2019)
Baseline churn rate~2%/moSVOD annual churn 20–25% → monthly 1.7–2.1%. Logit intercept calibrated accordingly.Parks Associates (2024)
Post-ad drop-off baseline17%Viewers abandoning content within 30s of mid-roll ad insertionConviva Streaming Benchmark (2023)
Ad-supported streaming share68%Proportion of streaming subscribers using ad-supported optionsDeloitte Digital Media Trends (2026)

Theoretical Frameworks Integrated

FrameworkRole in ModelSource
Advertising Value ModelRelevance, trust → satisfaction (+); intrusiveness, fatigue → satisfaction (−)Ducoffe (1996); Brackett & Carr (2001)
Web Ad Intrusiveness ScaleIntrusiveness construct definition and item structureLi, Edwards & Lee (2002)
Internet Ad Avoidance ModelGoal impediment, ad clutter, prior negative experience → avoidance behaviorCho & Cheon (2004)
Adstock / Koyck ModelCarryover effect: adstock_t = exposure_t + λ × adstock_{t-1}Koyck (1954); Leone (1995)
Ad Wear-out / Wearout WearoutFatigue accumulation and the short-term vs. long-term reversal patternPechmann & Stewart (1988); Kronrod & Huber (2019)
Model Limitation Note
Kronrod & Huber (2019) show that advertising-induced irritation reverses over longer time horizons (short-term negative → long-term positive for brand preference as memory effects dominate). The current AR(1) fatigue model assumes monotonic accumulation within the 12-month window — appropriate for short-term experience research, but a known simplification. Extending to a non-linear decay model would be a natural next step with real data.

Data Generation & Analysis Pipeline

All code is available below. The full pipeline runs in a single Python environment. Click any block to expand.

ads_experience_synthetic_v2.py — Core simulation engine (key excerpts)
# Evidence-Informed Synthetic Longitudinal Data Generator
# Parameters calibrated from published literature — see Parameter Table above

from dataclasses import dataclass
import numpy as np

@dataclass
class EffectProfile:
    # All coefficients in 5-point scale absolute units (not standardized)
    # This ensures between-condition differences are visible in output
    name: str = "base"
    personalization_to_relevance: float = 1.20   # Google/Ipsos 2019: +40% tolerance
    diversity_fatigue_mitigation: float = 0.49   # Back-calc: 40% lifecycle extension
    adload_to_intrusiveness:      float = 0.28   # Li, Edwards & Lee (2002)
    adstock_to_fatigue_increment: float = 0.28   # Back-calc: SS fatigue ≈ 0.48 at 4.5min/hr
    relevance_to_satisfaction:    float = 0.40   # Ducoffe (1996); Kim & Han (2014)
    intrusiveness_to_satisfaction: float = -0.35
    fatigue_to_satisfaction:      float = -0.32
    trust_to_satisfaction:        float = 0.35
    continuance_to_churn_logodds: float = -0.80
    fatigue_to_churn_logodds:     float = 1.20   # Parks Associates 2024: ~2%/mo baseline

def simulate_one_scenario(cfg, ep):
    """
    Core longitudinal data generator.
    Time-varying dynamics:
      [1] Adstock: adstock_t = exposure_t + λ × adstock_{t-1}  [Koyck 1954]
      [2] Fatigue: AR(1) in bounded [0,1] state with variety mitigation
      [3] Latent constructs: generated in absolute 5-pt scale units
      [4] Behavioral outcomes: logistic models calibrated to benchmarks
    """
    fatigue_state = rng.uniform(0.02, 0.06, size=n)  # New subscriber: minimal prior fatigue
    adstock_state = np.zeros(n)

    for month in range(1, cfg.n_months + 1):
        # Adstock update (Koyck carryover)
        adstock_state = ad_exposure + cfg.adstock_lambda * adstock_state
        adstock_norm  = adstock_state / (adstock_state + 180.0)

        # Fatigue update: AR(1) with variety mitigation (Kronrod & Huber 2019)
        fat_increment = (ep.adstock_to_fatigue_increment * adstock_norm
                        * (1 - ep.diversity_fatigue_mitigation * diversity))
        fatigue_state = np.clip(cfg.fatigue_persistence * fatigue_state
                                + fat_increment + noise, 0, 1)

        # Satisfaction structural model (Ducoffe 1996)
        satisfaction_lat = (3.00
            + ep.relevance_to_satisfaction      * (relevance_lat    - 3.0)
            + ep.trust_to_satisfaction          * (trust_lat        - 3.0)
            + ep.intrusiveness_to_satisfaction  * (intrusiveness_lat - 2.5)
            + ep.fatigue_to_satisfaction        * (fatigue_lat      - 2.5)
            + re_satisfaction)

        # Churn intent: logistic, calibrated to ~2%/month (Parks Associates 2024)
        churn_logit = (-4.20
            + ep.continuance_to_churn_logodds * (continuance_lat - 3.0)
            + ep.fatigue_to_churn_logodds     * fatigue_state
            + 0.25 * price_sensitivity)
ads_analysis_pipeline.py — Full analysis pipeline (key excerpts)
# ── CFA: Confirmatory Factor Analysis ──────────────────────────────
model_spec = """
Relevance     =~ relevance_1 + relevance_2 + relevance_3
Intrusiveness =~ intrusiveness_1 + intrusiveness_2 + intrusiveness_3
Trust         =~ trust_1 + trust_2 + trust_3
Fatigue       =~ fatigue_1 + fatigue_2 + fatigue_3
Satisfaction  =~ satisfaction_1 + satisfaction_2 + satisfaction_3
Continuance   =~ continuance_1 + continuance_2 + continuance_3
"""
model = semopy.Model(model_spec)
model.fit(df_month6[item_cols])
fit_stats = semopy.calc_stats(model)

# ── GEE: Generalized Estimating Equations ──────────────────────────
# GEE chosen over LMM because ad condition variables are constant
# within members (between-subject design), causing singularity in
# LMM's random effect estimation. GEE with exchangeable correlation
# structure provides valid inference for this design.
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Gaussian
from statsmodels.genmod.cov_struct import Exchangeable

gee_m2 = GEE.from_formula(
    "satisfaction_score ~ time_c + load_z + pers_z + fatigue_score",
    groups=df["member_id"], data=df,
    family=Gaussian(), cov_struct=Exchangeable()
).fit()

# ── Cross-Lagged Panel Model ─────────────────────────────────────── 
# Within-person demeaning removes between-subject confounds,
# isolating the lagged temporal relationship.
for col in ["fatigue_score", "satisfaction_score", "fat_lag1", "sat_lag1"]:
    m = df.groupby("member_id")[col].transform("mean")
    df[col + "_dv"] = df[col] - m

path_A = smf.ols("satisfaction_score_dv ~ fat_lag1_dv + sat_lag1_dv + time_c",
                 data=df).fit(cov_type="HC3")
path_B = smf.ols("fatigue_score_dv ~ sat_lag1_dv + fat_lag1_dv + time_c",
                 data=df).fit(cov_type="HC3")

# ── Survival Analysis ────────────────────────────────────────────── 
# Weibull AFT for covariate inference; Kaplan-Meier for visualization.
waf = WeibullAFTFitter(penalizer=0.01)
waf.fit(surv_df, duration_col="duration", event_col="event")

Full source code: ads_experience_synthetic_v2.py (simulation engine) · ads_analysis_pipeline.py (analysis pipeline) · Python 3.12 · statsmodels 0.14.6 · lifelines 0.30.3 · semopy 2.3.11

📦 Full reproducible code available upon request. The complete simulation engine (~400 lines) and analysis pipeline (~350 lines) include all parameter definitions with inline source citations, factorial design loop, sensitivity sweep, and diagnostic visualizations. Please reach out directly for the full codebase.

Confirmatory Factor Analysis

What Each Construct Measures

ConstructDefinitionTheoretical Basis
RelevancePerceived informativeness and personal relevance of ads shownDucoffe (1996) Ad Value Model
IntrusivenessPerceived interruption and goal impediment from ad insertion. Distinct from irritation — it is the cognitive perception that the ad has disrupted the viewing flow, regardless of emotional valence.Li, Edwards & Lee (2002)
TrustPerceived credibility and trustworthiness of ads shownChoi & Rifon (2002)
Ad FatigueAccumulated sense of weariness from repeated ad exposure over time (time-varying)Pechmann & Stewart (1988)
SatisfactionOverall satisfaction with the viewing experience during the periodDucoffe (1996); Brackett & Carr (2001)
Continuance IntentIntention to continue using the service in the coming monthsBhattacherjee (2001)

Model Fit Indices

1.000
CFI (≥ .95 ✓)
1.000
TLI (≥ .95 ✓)
.000
RMSEA (≤ .06 ✓)
.996
GFI (≥ .90 ✓)
.718
χ²(120) p-value

χ²(120) = 110.67, p = .718. A non-significant χ² indicates good model-data fit. Note on perfect fit indices: CFI = 1.000 and RMSEA = .000 are mathematically inevitable when the data-generating process exactly mirrors the assumed factor structure — as it does here by construction. This is a known limitation of evaluating measurement models on synthetic data. In practice, real survey data from streaming members would introduce correlated residuals (adjacent items sharing method variance), measurement drift over time, and respondent heterogeneity. A realistic expectation for a well-designed real study would be CFI ≈ .95–.98 and RMSEA ≈ .03–.06. The Trust construct's weaker performance (AVE = .387) is the more meaningful finding here — it is not an artifact of the simulation, but reflects genuine multi-dimensionality in how members conceptualize ad credibility.

Factor Loadings & Convergent Validity

ConstructItemsStd. Loading RangeAVECRαVerdict
Satisfaction3.842 – .847.712.881.881✓ Strong
Fatigue3.820 – .833.685.867.867✓ Strong
Intrusiveness3.782 – .789.618.829.829✓ Good
Continuance3.762 – .776.592.813.814✓ Good
Relevance3.718 – .736.529.771.771✓ Adequate
Trust3.596 – .651.387.655.654△ Weak

AVE = Average Variance Extracted (threshold ≥ .50); CR = Composite Reliability (threshold ≥ .70). Trust falls below both thresholds, indicating its three items do not converge sufficiently on a single factor. This likely reflects the multi-dimensional nature of ad trust (brand trust vs. data privacy trust vs. ad content credibility). A revised instrument would separate these facets.

Discriminant Validity (Fornell-Larcker)

Each diagonal cell = AVE. Off-diagonal cells = squared inter-factor correlation (r²). Discriminant validity holds when the diagonal exceeds all values in the same row and column.

RelevanceIntrusivenessTrustFatigueSatisfactionContinuance

The Satisfaction ↔ Continuance pair fails the criterion (r² = .830 exceeds both AVEs of .712 and .592). This is expected — in consumer research, overall satisfaction and continuance intention are conceptually very close constructs. Practically, they can be treated as a combined "loyalty" dimension. Trust also fails against Satisfaction (r² = .428 > AVE .387), consistent with its convergent validity weakness.

Longitudinal Analysis Results

6.1 — Satisfaction & Fatigue Trajectories

Across all conditions, satisfaction declines monotonically from month 1 to month 12 as ad fatigue accumulates. The effect is most pronounced under high ad load (6.5 min/hr), where average satisfaction drops from 2.84 to 2.20 — a 0.64-point decline on the 5-point scale.

Member Satisfaction Over 12 Months — by Ad Load
Ad Fatigue Accumulation Over 12 Months — by Ad Load
Reading the Charts
Satisfaction and fatigue move in opposite directions — and the gap between low and high ad load conditions widens over time. This reflects the compounding nature of the AR(1) fatigue process: early differences are small, but by month 6–8 the system approaches steady state and divergence is maximized.

6.2 — GEE Longitudinal Model

Generalized Estimating Equations were used to model repeated satisfaction measurements while accounting for within-member correlation (exchangeable working correlation structure). Three nested models were compared using QIC (Quasi-likelihood under the Independence model Criterion).

ModelPredictorsQICΔQICVerdict
M1 BasicTime + Ad Load + Personalization34,681
M2 + FatigueM1 + Fatigue Score34,679−2✓ Selected
M3 + InteractionM2 + Load × Personalization34,688+7Not supported

M2 Coefficient Table

PredictorβSEzpInterpretation
Intercept3.734.017215.9***Baseline satisfaction at average conditions
Time (centered)−0.031.001−10.8***Satisfaction declines ~0.03 points per month
Ad Load (z-scored)−0.353.009−38.9***1 SD increase in ad load (= 2 min/hr) → −0.35 satisfaction
Personalization (z-scored)+0.200.009+22.8***1 SD increase in personalization → +0.20 satisfaction
Fatigue Score−0.314.006−56.6***1-point fatigue increase → −0.31 satisfaction
Load × Personalization−0.001.011−0.08.933Interaction not significant (M3 not supported)

*** p < .001. β values are unstandardized GEE estimates. Load and Personalization z-scored (SD=2 min/hr and SD=0.25 respectively) for comparability.

GEE Coefficients (M2) — Effect on Satisfaction

6.3 — Factorial Comparison at Month 12

Two-way ANOVA on month-12 satisfaction confirms large independent effects of both ad load (η² = .297) and personalization (η² = .083), with no significant interaction (η² = .000, p = .855). Ad load and personalization operate as additive, independent levers.

Month-12 Satisfaction Heatmap — Ad Load × Personalization Level

The range from worst condition (high load + low personalization: 1.91) to best condition (low load + high personalization: 3.56) spans 1.65 points on the 5-point scale — a substantial effect with meaningful real-world implications for member experience.

6.4 — Survival Analysis: Churn Intent

A Weibull AFT model regressed time-to-first-churn-intent on member-level average fatigue, satisfaction, ad load, personalization, and price sensitivity. Key finding: ad load does not directly predict when members develop churn intent. Fatigue and satisfaction do.

PredictorCoefexp(coef)pInterpretation
Mean Fatigue−0.301.740***Higher fatigue → churn intent arrives 26% sooner
Mean Satisfaction+0.2891.336***Higher satisfaction → churn intent delayed 34%
Price Sensitivity−0.255.775**Price-sensitive members reach churn intent sooner
Ad Load+0.0081.008.634Not significant — effect is fully mediated by fatigue
Personalization+0.0051.005.961Not significant — effect mediated by satisfaction/fatigue
Key Mediation Finding
Ad load predicts churn intent only indirectly — its effect operates through fatigue accumulation. This is the Weibull AFT equivalent of a full mediation pattern: when fatigue is in the model, the direct ad load coefficient is near zero and non-significant. Practically, this means the most powerful intervention for reducing churn is not reducing ad load per se, but managing the experience of ad load through fatigue mitigation strategies (creative variety, personalization, natural break placement).
Monthly Churn Intent Rate Over 12 Months — by Ad Load

6.5 — Cross-Lagged Panel Model

To test the temporal directionality of the fatigue–satisfaction relationship, I applied a within-person demeaned cross-lagged model. This removes all stable between-person differences (including condition effects) and tests whether changes in one variable precede changes in the other.

PathβSEtp
Path A: Fatiguet−1 → Satisfactiont −0.098.006−16.79*** .058
Path B: Satisfactiont−1 → Fatiguet −0.034.005−6.88*** .253

Both paths are statistically significant, confirming a bidirectional relationship. However, Path A is approximately 3× stronger than Path B (|−0.098| vs |−0.034|). Within the simulated longitudinal model, fatigue emerges as the dominant temporal antecedent of satisfaction — members who feel more fatigued in month t−1 report meaningfully lower satisfaction in month t, even after controlling for their prior satisfaction level. This pattern would need to be replicated in real member data before treating it as a generalizable finding.

Fatigue t−1 Satisfaction t β = −0.098*** Path A (dominant) β = −0.034*** Path B (~3× weaker)

Product Experiment: Creative Rotation Frequency

The longitudinal phase identified a clear product opportunity: creative diversity reduces fatigue accumulation, and fatigue mediates the ad-load-to-churn pathway. But correlational evidence — even causal-ordered evidence — has limits for product decisions. The next question is sharper and more actionable:

Research Question
If we increase creative rotation frequency, does the resulting fatigue reduction translate into measurable improvements in member experience and behavior?

This is the kind of question only a controlled experiment can answer. The longitudinal model says "yes, in theory." Product needs to know "yes, on real members, at this magnitude, for these segments, over this timeframe."

7.1 — Proposed Pre-analysis Plan

Note: This is an illustrative pre-analysis plan demonstrating how I would lock in design decisions before running the experiment. It is not registered with an external preregistration repository (e.g., OSF, AsPredicted); in production this plan would be formally registered.

ElementSpecification
Primary hypothesisArm B (1-week rotation) reduces fatigue vs. Arm A (2-week control) over 4-week treatment period
Design3-arm parallel trial — A (2-wk, control), B (1-wk, faster), C (4-wk, monotonicity check)
Sample4,500 members (1,500 per arm); pre-treatment fatigue stratified randomization recommended
Duration8 weeks: 4 pre-treatment baseline + 4 post-treatment exposure
Primary outcomeWeekly self-reported fatigue (5-point scale)
Secondary outcomesSatisfaction · Episode completion · Churn intent
Primary analysisDifference-in-Differences via GEE with exchangeable working correlation
Variance reductionCUPED using pre-treatment fatigue baseline
Multiple testingBenjamini-Hochberg FDR across 4 outcomes (q < 0.05)
Stopping ruleNo interim analyses; full 8-week sample analyzed once

7.2 — Power Analysis (Simulation-Based)

Rather than relying on closed-form approximations that assume independence (which doesn't hold in this longitudinal design), I conducted simulation-based power analysis: generate 100 synthetic trials at each candidate sample size, fit the planned DiD model, and count the proportion of trials that detect the effect at α = 0.05.

Power Curve · 100 Monte Carlo Simulations per Sample Size

Conclusion: 300 members per arm achieves 96% power for the expected effect size. The actual study uses 1,500/arm for comfortable margin and to enable subgroup analyses.

7.3 — Validation Checks

Two checks before touching the outcomes:

Sample Ratio Mismatch (SRM)

The classic A/B test pitfall: silent assignment bugs that cause the arms to be unbalanced. A chi-square test on observed vs. expected assignment proportions: χ² = 0.00, p = 1.00 → PASS. No assignment irregularities.

Baseline Balance (with SMD)

A subtle methodological point worth highlighting: with 4,500 members, traditional F-tests on baseline covariates can show "statistical significance" even when the actual differences are practically negligible. This is a Type I error inflation at large N — the test is too sensitive. Best practice is to report Standardized Mean Differences (SMD) instead, with conventional thresholds: |SMD| < 0.10 indicates negligible imbalance.

CovariateF-test pSMD (B–A)SMD (C–A)max |SMD|Verdict
Pre-treatment fatigue.006+0.041−0.0160.041✓ negligible
Pre-treatment satisfaction.031+0.040−0.0030.040✓ negligible
Ad tolerance.009−0.019+0.0360.036✓ negligible
Price sensitivity.004+0.060+0.0230.060✓ negligible

All F-tests are "significant" by p-value, but all SMDs are well below the 0.10 threshold for meaningful imbalance. This is the large-sample randomization-check artifact — reporting only F-tests here would be misleading.

7.4 — Primary Analysis: Difference-in-Differences

The DiD design isolates the treatment effect from any pre-existing differences between arms. Coefficient estimates are the period × arm interaction, fitted via GEE with exchangeable working correlation to handle within-member dependence.

ContrastDiD EffectSE95% CIpInterpretation
Arm B vs Arm A
1-wk vs 2-wk rotation
−0.125 0.009 [−0.142, −0.109] <.001 Faster rotation reduces fatigue by 0.125 points on 5-pt scale
Arm C vs Arm A
4-wk vs 2-wk rotation
+0.118 0.008 [+0.101, +0.134] <.001 Slower rotation increases fatigue by 0.118 points — monotonicity confirmed
Monotonicity Check
The symmetric pattern of arms B and C around the control (one reduces fatigue by ~0.13, the other increases by ~0.12) is methodologically important: it confirms the effect is dose-responsive in the diversity dimension, not a quirk of any single comparison. This is exactly what arm C was designed to test.

7.5 — CUPED Variance Reduction

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique from Microsoft (Deng et al., 2013). The idea: members who are naturally higher on the outcome at baseline will also be higher post-treatment. By adjusting each member's post-treatment outcome for their baseline, we strip out predictable variance and tighten the treatment effect estimate.

Treatment Effect Estimate · Naive vs. CUPED-Adjusted
MethodEffectSE95% CI Width
Naive t-test−0.1140.00780.0305
CUPED-adjusted−0.1180.00730.0286
Improvement12.6% variance reduction → same precision with ~13% smaller sample

7.6 — Heterogeneous Treatment Effects

Average treatment effects can mask important variation across member segments. By splitting members into pre-treatment fatigue tertiles, we can ask: which members benefit most from faster rotation?

Treatment Effect by Pre-treatment Fatigue Tertile (Arm B vs A)

All three segments show statistically significant benefit, but the Mid-fatigue segment shows the largest effect (β = −0.136). This is the actionable nuance: low-fatigue members have less room to improve, and high-fatigue members may have other drivers (e.g., long-term subscriber frustration we saw in the survival analysis). The Mid-fatigue segment — members who are starting to feel the load but haven't given up — is the optimal target for this intervention.

7.7 — Secondary Outcomes & FDR Correction

Running multiple outcome tests inflates the Type I error rate. The Benjamini-Hochberg procedure controls the false discovery rate (FDR) — the expected proportion of false positives among rejected nulls — at q < 0.05.

OutcomeEffectRaw pBH q-valueFDR-Sig?
Fatigue (primary)−0.125<.001<.001✓ Yes
Satisfaction+0.056<.001<.001✓ Yes
Churn intent+0.107.541.639— No
Episode completion+0.041.639.639— No
Interpretation of Mixed Results
Fatigue and satisfaction respond immediately and substantially. Churn intent and completion do not — at least not in 4 weeks. This is consistent with the longitudinal phase: behavioral consequences accumulate over months, not weeks. A 4-week experiment can validate the upstream mechanism (attitudinal response) but cannot validate the downstream business impact (retention). A longer-duration study or a behavioral surrogate would be needed for that.

7.8 — Product Recommendations

Translating findings into actionable product strategy is where research either earns trust or doesn't. Based on this experiment, my recommendations to the product team would be:

A
Adopt faster creative rotation as default Moving from 2-week to 1-week rotation reduces fatigue accumulation by ~25% relative to baseline trajectory. The CUPED-adjusted effect is precise (CI: [−0.13, −0.10]). Costs of more frequent creative refresh should be weighed against this experience gain.
B
Target the Mid-fatigue segment for highest ROI If full rollout has cost constraints, prioritize members in the middle fatigue tertile — they show the largest treatment response and represent the segment most likely on the trajectory toward problematic experience.
C
Extend to a longer-duration follow-up to validate retention impact Attitudinal improvements are confirmed; behavioral (churn) impact requires a longer observation window. Recommend a 16-week or 26-week extended experiment with a longitudinal hold-out to measure downstream retention.
D
Build into the ongoing measurement system The instrumentation here — fatigue, satisfaction, completion, churn intent — should be operationalized as ongoing tracking. Rotation cadence experiments are unlikely to be the only relevant intervention; the next likely candidates are placement (mid-roll vs. natural-break) and personalization quality.

Summary of Demonstration Findings

All findings below are derived from synthetic data generated under the modeling assumptions documented in §3. They illustrate what the analytic pipeline would reveal under those assumptions, not validated claims about real Netflix members. Each pattern is a testable hypothesis for a future study with real member data, not a generalizable conclusion.

01
In the Synthetic Framework, Fatigue Mediates the Ad Load → Churn Intent Pathway Under the specified data-generating assumptions, ad load does not directly predict churn intent timing (Weibull AFT, p = .634). Its effect operates through fatigue accumulation. The implication this framework would help test in a real study: managing the fatigue experience (creative variety, personalization, natural break placement) may be a more tractable lever than reducing ad inventory.
02
Ad Load and Personalization Appear as Independent, Additive Levers The interaction between ad load and personalization was non-significant in this simulation (β = −0.001, p = .933). Both factors show large independent effects (η² = .297 and .083 respectively). This framework would allow product and monetization teams to test whether the two levers indeed operate independently in real-member data, or whether genuine interactions emerge that the synthetic model does not capture.
03
The Simulated Longitudinal Model Identifies Fatigue as the Dominant Temporal Antecedent of Satisfaction Within this synthetic framework, cross-lagged analysis shows fatigue at t−1 predicting satisfaction at t (β = −0.098, p < .001) about 3× more strongly than the reverse path (β = −0.034). This is a temporal-ordering pattern produced by the simulation, not a causal claim — establishing whether fatigue actually precedes satisfaction in real data would require a designed study with appropriate identification strategy.
04
The Simulated Satisfaction Decline Emerges Early and Persists Under high ad load (6.5 min/hr) in the synthetic model, satisfaction drops 0.22 points in the first two months alone (from 2.84 to 2.69) and continues declining to 2.20 by month 12. The framework suggests that — if real-member data show a similar pattern — experience interventions may be most impactful early in the membership lifecycle, before fatigue reaches the plateau of its AR(1) steady state.
05
Measurement Diagnostic: Trust Construct Requires Refinement The Trust construct showed weak convergent validity in the demonstration (AVE = .387, below the .50 threshold). Importantly, this is a measurement diagnostic example — not an empirical finding about real ad-trust attitudes. It illustrates the kind of psychometric issue a careful CFA can surface: a revised instrument would distinguish brand trust, data privacy comfort, and ad content credibility as separate facets.

Implications for Streaming Ad Experience Research

For Product & Ads Experience Teams

If the mediation pattern observed in this simulation replicates in real-member data, it would have a direct product implication: optimizing fatigue management may be a more tractable target than ad load reduction, because it can be addressed through product levers (break placement, creative rotation frequency, personalization quality) that don't directly reduce ad inventory or revenue. A 40% reduction in fatigue accumulation through creative diversity — estimated from OTT industry benchmarks — could meaningfully extend the time before members reach dissatisfaction thresholds. This is a hypothesis the framework is designed to test, not a validated effect.

For Measurement Design

Ad fatigue should be measured as a time-varying construct within longitudinal survey designs, not a single cross-sectional measure. The AR(1) dynamics observed here suggest that surveys administered at months 3–4 will underestimate eventual steady-state fatigue, while surveys administered after month 8 will better capture the plateau. Diary study designs or embedded in-app pulse surveys (1–2 items per session) would provide higher temporal resolution than traditional monthly surveys.

For Research Methodology

This demonstration illustrates the value of connecting attitudinal survey data (what members say) with behavioral signals (what they do: drop-off rates, episode completion, churn). The Weibull AFT finding — that attitudinal fatigue predicts churn timing more directly than ad load — would be impossible to detect from behavioral data alone, and invisible from survey data without longitudinal follow-up.

Research Agenda Suggestion
The most impactful next study would be a 6-month longitudinal diary study with embedded behavioral linkage: monthly 5-item pulse surveys (fatigue, relevance, intrusiveness) linked to individual-level streaming logs. The synthetic framework developed here provides a starting measurement instrument, expected effect sizes for power analysis, and an analysis plan — ready to be adapted and recalibrated with real member data, not deployed as-is.

Limitations of This Demonstration

LimitationImpactHow to Address in Real Study
Synthetic data cannot capture real member heterogeneityEffect sizes may differ substantiallyPilot survey (n≈200) to recalibrate parameters
AR(1) fatigue model assumes monotonic accumulationUnderestimates long-term fatigue decay (Kronrod & Huber reversal)Extended observation window (>18 months) with non-linear growth model
Trust construct has weak AVE (.387)Trust-related path coefficients are attenuatedRevise to distinguish brand trust, privacy comfort, content credibility
Single-period condition assignment (members in one condition throughout)Cannot test within-person condition changesCrossover design or randomized condition change at mid-study
No content type moderatorFatigue may differ by content genre (sports vs. drama vs. series)Add content-type as a moderator in the structural model

Researcher Profile & Purpose of This Demonstration

Why I Built This

This project is designed as a methodological bridge: it demonstrates how I would enter a new product domain, translate business questions into measurable constructs, generate evidence-informed synthetic data, and prepare a validated analysis plan — before working with real member-level data.

My research background is in educational psychology, psychometrics, and quantitative methods — not advertising or consumer product research. Rather than claiming experience I don't have, I chose to demonstrate how I would approach an unfamiliar product domain: by grounding the problem in published theory, building a transparent and testable measurement framework, and applying longitudinal methods appropriate to the research question.

In short: I can translate a product question into a measurable longitudinal research design, build an evidence-informed synthetic prototype before accessing internal data, and identify which behavioral and attitudinal signals should be validated with real member data. That is what this portfolio demonstrates.

This framework — evidence-informed synthetic data, factorial sensitivity analysis, and a full CFA → GEE → Survival → CLPM pipeline — represents how I would design and analyze a first study before having access to real member data.

Core Methodological Competencies

Survey instrument design & psychometric validation
Confirmatory Factor Analysis (CFA) & SEM
Longitudinal modeling (GEE, LMM, growth models)
Survival / event history analysis
Cross-lagged panel models & temporal ordering in longitudinal observational data
Factorial design & sensitivity analysis
Simulation-based research design
Python (statsmodels, lifelines, semopy, scikit-learn)
R (lavaan, lme4, survival) — for cross-validation
Conjoint analysis & MaxDiff (actively developing)
Synthetic Data Disclosure (Full)
This entire report is based on computationally generated synthetic data. No Netflix member data, internal metrics, proprietary research, or confidential information was used or referenced. Parameter values were derived exclusively from (a) peer-reviewed academic literature, (b) publicly available industry reports, and (c) mathematical back-calculation from published benchmarks. This work is intended solely as a methodological portfolio demonstration.

Selected References

Theoretical frameworks, empirical benchmarks, and industry sources used in parameter calibration and model design.

ReferenceRole in This Study
Bhattacherjee, A. (2001). Understanding information systems continuance: An expectation-confirmation model. MIS Quarterly, 25(3), 351–370.Theoretical basis for Continuance Intention construct
Brackett, L. K., & Carr, B. N. (2001). Cyberspace advertising vs. other media: Consumer vs. mature student attitudes. Journal of Advertising Research, 41(5), 23–32.Ad value → satisfaction path coefficients
Cho, C.-H., & Cheon, H. J. (2004). Why do people avoid advertising on the internet? Journal of Advertising, 33(4), 89–97.Ad avoidance model; intrusiveness → churn pathways
Choi, S. M., & Rifon, N. J. (2002). Antecedents and consequences of web advertising credibility. Journal of Interactive Advertising, 3(1), 12–24.Trust construct theoretical basis
Conviva. (2023). State of Streaming. Conviva Inc.Post-ad drop-off baseline (~17%); behavioral benchmarks
Deloitte. (2026). Digital Media Trends Survey. Deloitte Insights.Ad-supported streaming adoption rate (~68% of subscribers)
Ducoffe, R. H. (1996). Advertising value and advertising on the web. Journal of Advertising Research, 36(5), 21–35.Advertising Value Model; relevance/trust → satisfaction (+), intrusiveness → satisfaction (−)
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50.AVE and CR thresholds; Fornell-Larcker discriminant validity criterion
Kim, Y. J., & Han, J. (2014). Why smartphone advertising attracts customers: A model of Web advertising, flow, and personalization. Computers in Human Behavior, 33, 256–269.Mobile advertising meta-analytic path coefficient estimates
Kronrod, A., & Huber, J. (2019). Ad wearout wearout: How time can reverse the negative effect of frequent advertising repetition on brand preference. International Journal of Research in Marketing, 36(2), 306–324.Fatigue persistence estimates; short-term vs. long-term reversal pattern; model limitation
Koyck, L. M. (1954). Distributed Lags and Investment Analysis. North-Holland.Adstock carryover model structure: adstock_t = exposure_t + λ × adstock_{t-1}
Leone, R. P. (1995). Generalizing what is known about temporal aggregation and advertising carryover. Marketing Science, 14(3), G141–G150.Adstock λ = 0.47 (meta-analytic median across 128 studies)
Li, H., Edwards, S. M., & Lee, J.-H. (2002). Measuring the intrusiveness of advertisements: Scale development and validation. Journal of Advertising, 31(2), 37–47.Intrusiveness construct definition, scale structure, and items
Mynt Agency. (2025). Predicting Ad Fatigue: When to Refresh Your Creative. Retrieved from articles.myntagency.com40% lifecycle extension from creative rotation; TV refresh cycle benchmarks
Netflix. (2026, May). Netflix Upfront 2026 Presentation. Netflix, Inc.250M+ global monthly active viewers on ad-supported plan (2026)
Parks Associates. (2024). OTT Video Market Tracker. Parks Associates.SVOD annual churn rate ~20–25%; monthly baseline ~1.9% calibration
Pechmann, C., & Stewart, D. W. (1988). Advertising repetition: A critical review of wearin and wearout. Current Issues and Research in Advertising, 11(1-2), 285–329.Fatigue persistence across sessions; wear-out dynamics
Reuters. (2025). Netflix ad-supported tier reaches 94 million monthly active users. Reuters.2025 ad-tier MAU baseline (94 million)
Strategus. (2025). What is Ad Creative Fatigue & How to Minimize it on OTT/CTV Channels. Retrieved from strategus.comOTT creative wear-out cycle 3–4 weeks; diversity mitigation parameter calibration