Physical Sciences
February 2, 2022

A parametric framework for multidimensional linear regression

While Ordinary Least Squares regression establishes fundamental concepts in data analysis, it requires the independent variable to be error-free and the error term to be constant. Dr Stanley Luck, statistics consultant and founding member of Vector Analytics LLC in the US, has developed an innovative parametric framework for multidimensional linear regression. This provides a more general framework for establishing measurement error regression, even if data for both variables is subject to error.

Dr Stanley Luck, statistics consultant and founding member of Vector Analytics LLC, Delaware, was motivated to investigate the applied algebraic foundations of data analysis after working on a collaborative research and development project involving the identification of beneficial agronomic variation in maize. This project involved the application of genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) methods to identify genetic variants. After performing high-dimensional searches of the data, Luck observed that the results from the classification and regression tree (CART) analyses did not correspond well with the GWAS.

Luck uncovered extensive research literature, and his applied algebraic investigation of the merits of various effect size measures and their associated statistical methodologies has already been recorded in two recent journal publications. In this third phase of his research into the foundations of data analysis, he investigated the issue of fitting a multidimensional line to data that are subject to stochastic error (ie, a random effect that may result in an outcome that is not expected, even though both the model and parameters are correct). This led to his developing a novel parametric framework for multidimensional linear regression.

Luck identified beneficial agronomic variation in maize using genome-wide association studies. stevanovicigor/

Ordinary Least Squares regression
Ordinary Least Squares regression is a common statistical technique for modelling a two-dimensional linear relationship between an independent variable, x, and a dependent variable, y. It produces the straight line that minimises the sum of the squares (the least squares) of the difference between the observed and predicted values.

Applying the chain rule to this linear relationship led Luck to discover a novel parametric framework for linear regression.

Luck explains that while Ordinary Least Squares regression serves a definitive role in establishing fundamental concepts in data analysis, it requires the independent variable to be error-free and the variance of the residual, or error term, to be constant, or homoscedastic. If the Ordinary Least Squares assumption of constant variance in the errors is violated, the Weighted Least Squares method can be used. This is an extension of Ordinary Least Squares regression where non-negative weights are applied to the data points. The error-free condition, however, is a requirement of the Moore-Penrose inverse algorithm that is used to estimate the parameters of the Weighted Least Squares regression model. Furthermore, if the independent variable is subject to error, the Ordinary Least Squares regression estimate for the slope is reduced, causing the attenuation of the Pearson correlation coefficient that measures the strength of the linear relationship between two variables. This has spurred Luck’s longstanding research effort to develop a more general framework for establishing measurement error regression where data for both variables can be subject to error – an error-in-variable regression model.

Comparison of various measurement error regression methods. All charts sourced:

Measurement error
Measurement error refers to a sub-discipline of statistics supported by extensive literature and a long history. Luck relates how the wide range of opinions about both the statistical framework and methodology of measurement error suggest that the standard textbook treatment of linear regression may be incomplete. Moreover, the confusion surrounding the fundamental role of measurement error models and Weighted Least Squares optimisation in partitioning the effects of errors contributes to the problem of irreproducibility in data analysis.

The chain rule and linear regression
Luck discusses his novel idea that statistical measures of linear dependence, including covariance, correlation, and regression slope, are all subject to the chain rule. The chain rule is used to differentiate a function of a function, or compound function of the form f(g(x)). He also notes that the standard linear regression framework is bivariate because it is based on the Cartesian representation y = f(x), since the dependent variable y is explained by the independent variable x. Applying the chain rule to this linear relationship led him to discover a novel parametric framework for linear regression.

Bounded ranges of the regression slopes for the errors in both x and y.

Parametric vs Cartesian representation
A curve can be defined using a Cartesian equation, an equation in terms of x and y only. Alternatively, a parametric equation can be used where both x and y are functions of a third variable (usually t). Luck demonstrates how employing a parametric representation, rather than a Cartesian equation, enabled him to obtain a more general framework for linear regression that also takes the experimental error in all variables into account. Using the chain rule, he transformed the ordinary linear regression method to the parametric representation (x(t), y(t)), with t corresponding to an element of a convex set (a convex set is made up of points so that the line joining any two points in the set lies entirely within that set, so the set is connected).

Multidimensional linear measurement error regression
Taking his innovative parametric framework for two-dimensional linear regression, Luck extended this method for modelling bivariate point data and applied it to multidimensional vector data. Thus, he has created a new framework for fitting a multidimensional line for a set of linearly related variable vectors for applications of multidimensional linear regression.

Overdispersion of the residual effects in replicated RNA-Seq data.
Distribution of the statistical parameters in parametric linear regression.

In this measurement error model, the relationship between variable vectors is represented by a weighted average, with the weights determined from an error model for the input data. Here, the weighted average corresponds to the minimum coefficient of variance for error, the smallest ratio of the standard deviation of the error to the mean, and the optimal signal-to-noise ratio. In the latter, the signal is the difference in response values and the noise is the natural variation within the system. This weighted average serves as the independent variable for parametric multidimensional linear regression. Luck adds that without any loss of generality, t can be regarded as a fixed variable due to the homogeneous coordinates property of the slope vector, as algebraically, all points are treated equally.

Statistical measures of linear dependence, including covariance, correlation, and regression slope, are all subject to the chain rule.

In the parametric representation, the covariances form a parametric covariance vector. Moreover, the covariance is a measure of the linear dependence between the variable vectors and therefore subject to the chain rule. Consequently, Luck was able to achieve a parametric generalisation of the Pearson correlation in the form of a parametric correlation tensor, a generalisation of scalars and vectors that measures the multi-way parametric correlation.

Conical quadratic error parameters for replicated RNA-Seq data.
Noise reduction for a weighted average of variable vectors.

Practical applications of the multidimensional parametric framework
Among the many possible applications for the parametric framework for linear regression in the big data world is RNA sequencing (RNA-Seq). RNA-Seq is used to find the exact sequence of the building blocks that make up all RNA (ribonucleic acid) molecules in a cell. It analyses the transcriptome, the collection of gene readouts in a cell, to learn more about which of the genes encoded in our DNA are turned on or off. The simplest way to quantify RNA-Seq gene expression is to count the number of reads that align with each gene. This process is known as a read count, with a read defined as the oligonucleotide (a short single strand of RNA) that has been sequenced, so the count is the number of reads that overlap at a particular genomic position.

In this research, Luck demonstrates the application of the multidimensional parametric framework algorithm to publicly available data regarding the conical dispersion analysis of error in RNA-Seq and shows how it estimates the measurement error regression parameters for replicate RNA-Seq data together with the quadratic error in RNA-Seq.

Wider implications
Luck remarks that the fact that statistical measures for linear dependence, such as regression slopes, covariance, and correlation coefficients, are subject to the chain rule has broad implications for multivariate statistics and data science. He concludes that ‘outreach that communicates the key findings from this work and helps to remove misconceptions about measurement error is important because of the application of linear regression methodology in many disciplines’.

RNA sequencing is among the many possible applications for the parametric framework for linear regression. ktsdesign/

Personal Response

What do you envisage to be the next stage in your research into the foundations of data analysis?

The next stage of my research involves the development of data analysis methods for complex systems, such as nursing homes, organisms, and economic systems. I expect that accounting for the degrees of freedom, unbalanced sample size, and measurement error is necessary for obtaining reproducible results in data analysis for such systems. However, there is an additional mathematical complication arising from the fact that the performance of such systems is determined by complex interactions between many components. Furthermore, there will be alternative ways to optimise performance; there is not a unique well-defined solution. Instead, the objective is to explore the space of solutions using data analysis methods such as CART, to obtain functional information that helps in the development of engineering models for predicting improved performance. The specification of cost–benefit trade-offs between different forms of variation is required and the criteria for substantive significance for effect size will vary depending on the particular application. There is no ‘one effect size fits all’ approach in data analysis for complex systems.

This feature article was created with the approval of the research team featured. This is a collaborative production, supported by those featured to aid free of charge, global distribution.

Want to read more articles like this?

Sign up to our mailing list and read about the topics that matter to you the most.
Sign Up!

Leave a Reply

Your email address will not be published.