Why an ordinary least-squares line can look biased with correlated data

TL;DR

A user generated a correlated bivariate data set and found that an ordinary least squares (OLS) regression line looked 'tilted' compared with the principal axis from a covariance eigenvector. Commenters explained the difference comes from the methods' assumptions: OLS minimizes vertical errors and treats X as exact, while PCA/total least squares (TLS) treats variance in X and Y symmetrically.

What happened

The asker simulated a correlated two‑dimensional sample (20,000 points), plotted the points, fitted a standard linear least‑squares line via numpy.polyfit, and overplotted the principal eigenvector of the sample covariance matrix. The OLS line appeared not to align with the major axis of the data cloud, prompting the question. Replies pointed out this is expected: OLS optimizes the sum of squared vertical residuals and therefore estimates the conditional expectation E[Y|X=x], whereas the covariance eigenvector (PCA) identifies the direction of maximal variance in the joint distribution and is closely related to orthogonal or total least squares. Several answers supplied algebraic and geometric intuition (including the OLS slope formula β_OLS = σ_xy / σ_x^2), noted when the two lines coincide (e.g., equal X and Y variance or perfect correlation), and reminded the asker that choosing OLS versus TLS/PCA depends on which error model matches the data.

Why it matters

Different fitting methods optimize different loss functions and answer different questions (conditional prediction vs. symmetric description of the cloud).
Using OLS when X has measurement error can give misleading direction estimates compared with methods that account for errors in both variables.
Visual intuition (people tend to favor orthogonal fits) can conflict with OLS because human perception often targets the major axis rather than conditional expectation.
Model choice affects predictions: OLS estimates E[Y|X] while TLS/PCA characterizes the principal axis of variation.

Key facts

The user generated data by transforming Gaussian latent variables with a dependency matrix, adding an offset, and drawing 20,000 samples.
The covariance matrix of (X,Y) was computed and its largest eigenvector was plotted to indicate the direction of maximum variance (principal component).
OLS was fitted with numpy.polyfit, which minimizes vertical (Y) squared errors and yields slope β_OLS = σ_xy / σ_x^2.
PCA finds the direction of maximal joint variance; the corresponding eigenvector is not generally the OLS regression line.
Total least squares (TLS) or orthogonal regression minimizes orthogonal distances and is more symmetric in X and Y than OLS.
The PCA/TLS line and the OLS line coincide only under special conditions (for example, equal variances in X and Y or perfect linear correlation).
Commenters suggested small plotting markers to reduce overplotting and noted prior literature showing humans often intuit orthogonal fits rather than OLS.

What to watch next

Check the error model: decide whether X should be treated as exact or noisy before choosing OLS vs TLS/PCA.
If your goal is prediction of Y given X, OLS targets E[Y|X] and is the appropriate estimator under the usual assumptions.
If you want a symmetric description of the data cloud (major axis), consider PCA or TLS; otherwise, stick with OLS.

Quick glossary

Ordinary least squares (OLS): A regression method that fits a line by minimizing the sum of squared vertical residuals, treating the independent variable as exact.
Principal component analysis (PCA): A technique that finds orthogonal directions of maximum variance in multivariate data; the leading eigenvector of the covariance matrix points along the major axis.
Total least squares (TLS): A fitting method that minimizes orthogonal distances from points to a line, treating errors in both variables symmetrically.
Covariance matrix: A matrix summarizing pairwise variances and covariances of multivariate data; its eigenvalues and eigenvectors characterize spread and principal directions.

Reader FAQ

Does the OLS line go through the center of the data?
Yes — OLS passes through the sample mean (μx, μy); the apparent offset is due to the orientation of vertical-error minimization, not a failure to cross the center.

Is PCA the same as linear regression?
No — PCA finds directions of maximal joint variance, while regression (OLS) estimates the conditional expectation E[Y|X] by minimizing vertical errors.

Should I use TLS instead of OLS for correlated data?
It depends on the noise assumptions: use TLS/PCA when errors in X and Y are both relevant; use OLS when X is effectively error-free. The source emphasizes method choice should match the data model.

Will PCA give better predictions for Y given X?
not confirmed in the source

10 I used python to generate a correlated data set for testing, and then plotted a basic linear least-squares fit. The result looked a bit strange to me, because the…

Why an ordinary least-squares line can look biased with correlated data

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Recreating Steve Jobs’s 1975 Atari Horoscope Program — Run the Emulation

Is the Internet’s social era over? Many users say they’re exhausted

Electronic nose technology for detecting and identifying indoor mold

Leave a Reply Cancel reply

You missed

SMTP Tunnel: A SOCKS5 proxy that masks TCP as SMTP to bypass DPI

Recreated: Steve Jobs’s 1975 Atari horoscope program — now runnable

Google to publish AOSP source twice yearly, a setback for custom ROMs

Transform your phone into a true productivity workhorse with a USB-C hub