2.2.10. PCA - Principal Component Analysis
The principal componet analysis or PCA or Hotelling transform or Karhunen-Loève transform its a technique of reducing the total number of variables. This technique its used when exist redundancy in data from a sample, where redundancy can be correlated data, probably in function of its measuring the same event. The existence of redundancy its what allows the variable number reducing. To check it existence, the correlations matrix must be analyzed.
The PCA technique consists in rewrite the coordinates in a data set in other coordinates system which be more convenient to this data analysis. This new coordinates are the result of linear combination from original variables and are represented on orthogonal axis, being obtained in decreasing variance order . Therefore, the first principal component hold more data information than the second which not holds informations entered before (on the first component) and thus successively. By the axis orthogonality, the principals components are not correlated.
The total amount of principal components its equal to the amount of original variables and presents the same statistics information. However, this method allows reducing the total amount of variables cause usually the first principals components holds more than 90% of statistics information from the original data.
Consider, in example, a data set represented by two variables - normalized - which the original axis describe a plan. To represent one point will be needed informatiosn from two variables.
Original Data
Considering now a new axis, result from linear combination of thsi two variables. This new axis points in direction of bigger variability from the sample and its, therefore, a principal component to this data set.
Pca Axis
In this example, a point in the plan can be defined only by a coordinate or score in the new axis. The relationship between variables its given, so , by the load factor of this variables, represented by the new axis inclination co-sen to the variable on the horizontal axis and by the inclination sen in the new axis to the variable in the vertical axis.
Using more variables, the visual example turns impossibles however use the same technique, sticking-up the orthogonality in the new axis.
The PCA, however, can reduce the amount of variables. The process itself, represents accumulation of redundancy present in data in principal components. Even so, using only the PCA results, its not possible - usually - identify which variables holds another factor and, therefore, interpreting the mean of each component and understand the structures of dependences.
To solve it, are used rotational process (VARIMAX, in example for orthogonal rotation) aplied to the principal components. This process are linear transforms of the factors and its purpose is allow the interpretation of the meaning of principal components. Simply, the rotational process make easy interpreting data cause its objective is increase the load factors of a variable and reduce the amount of factors which a variable loads.
A confusion which must be clarified is PCA is not equal to factors analysis. Although teh results in both methods are many times similar, exists big concept differences between it. Simply, the analysis of factors search a casual structure between the variables, while PCA searches only reducing the variables set keeping, maximized, the explanation power (total variance). Although this, PCA can be used to associate itself with one principal component or a real factor.
The PCA analysis has as result a variable set (principal components) related (linear combination) with each variable of load factors. In same way, each variable relate itself with the new variables by load factors. By the possible heterogeneity of original variables, the data from sample can be normlized to make easy the interpretation of results achieved and extraction from components. This can be made using the matrix of correlations between original variable.
PCA is so used in image treatment and facial recognition to reduce the amount of variables which represent an image.
In finances, PCA can be used, in example, to represent any yield curve behavior (points in time are correlated) in a less variables data set, allowing behavioring simulations to any curve through this new variables.
Other financial application to PCA can be the modeling or choosing of variables to APT (Arbitrage Pricing Theory) use.
The PCA analysis use eigenvalue and eigenvector concepts, simply represented:
Be T:V -> V a linear operator. A vector v belongs to V, not being v a null vector, its called eigenvector of T if exist a real number such :
where its named eigenvalue of T associated to eigenvector v.
|
2.2.10.1. PCA Command
Access:
- Menu - Metrixus | PCA
- Toolbar Metrixus
Description:
Realize the PCA of informed data. Returns a set of indicators to the analysis, including the load factorsfor variables. Use the correlations matrix - calculated or informed - between data (or columns or variables) to effect the normalization.
Allows the interpretating results and the selection principal components following 3 criterias:
- Kaiser criteria or eigenvalue 1: must be considered only the components with eigenvalue superior than 1, what mean that the component enters more variance than a variable;
- Proportion criteria: its observed the the cumulative variance proportion and a cut level is established, representing the total variance entered by selected components; and
- Scree test: through a chart analysis, consider only teh components before a certain interval, if it exists.
In selecting principal components , only the Kaiser and Proportion criterias are used. Choose Kaiser criteria or Proportion criteria in options. In case of proportion criteria, its necessary inform the cut point in cumulative proportion to consider the components. The default is 80%.
For mounting Scree test must be selected the option Build "Scree test" chart. The selection of it not make difference in principal component selection and its used only to a comparative chart analysis with the selected criteria.
The PCA command can realize VARIMAX orthogonal rotation of the load factors matrix, facilitating the results interpretation. For it, select the option VARIMAX rotation and inform the precision which must be used in the iteration process for axis rotation in VARIMAX precision and the maximum number of iterations which will be realized in Maximum iterations. The default is 0,001 to precision 50 and iterations. The iteration process will be interrupted when it reach the maximum number of iterations or the precision set. The option for VARIMAX rotation create additional data for results interpretations.
The area or data interval must be a contiguous area where each column represents values (real numbers) for a variable. Its necessary at least two columns and total of lines minimum equal to the total of columns. Field with text or empty, ad well correspondig data of others columns will not be considered. The data interval must be selected before calling this command.
Its possible to realize also the analysis of principal components directly from the correlations matrix between variables. To this, select the option Range of cells represents correlation matrix. In this case, the informations realtive to the data sample - as means, standard-deviations and correlations matrix - will not be informed. The entrance matrix must be an area with the same number of lines and columns and must have at least two informations about correlations (two lines and two columns at least). Again, fields with text format or empty will not be considered, as well all corresponding data on others columns. This may cause errors of correlations matrix format or incoherent results! Only its possible to select the option for correlation matrix if the valid cells range corresponds to a square matrix, symmetrical and representative of correlations between data (values between -1 and 1).
Important:
When the entrance data represents a correlation matrix, this must be square (columns size equals lines size) and symmetrical, with diagonal equal 1 and other values between 1 and -1!
|
Important:
If any mathematical operator be aplied to original data, the PCA analysis represents the variables linear combination modified by the operator. Remember to revert the operation before use principal components!
|
This command make a new file containing the results in tables and charts.
The chart and plan without colors allow easy data printing also represents better run performance.
The PCA result its a new plan with statistics data, it is, without linking with the database which origin the result.In this new sheet exists the following information, where n is the total data valid and m is the amount of variables (in case of correlations matrix be informed, the entrance data are a matrix m x m):
- Mean: mean of each variable or column. Not informed for the option of correlations matrix as entrance.

- St. Dev.: standard deviation of each variable or column from sample. Not informed for the option of correlations matrix as entrance.

Important:
For determinating of statistics parameters of data - like mean, standard deviation - its not aplied any mathematical operator (logarithm, in example). This way, the mean presented must be interpreted as the values hope, as well the standard deviation must be read as the volatileness!
|
Important:
All data are considered samples and therefore all statisctics calcs of standard deviation are based in samples and not in population.
|
- Correlations: matrix m x m containing teh variables correlations. Not informed for the option of correlations matrix as entrance.

- Eigenvalues: characteristic value of the correlation matrix between the variables. Are presented both eigenvalues as the number of variables and columns (m) and each eigenvalue represents the total variance entered by the component. To determinating eigenvalues, its used the Hessenberg reduction over the correlations matrix and the algorithm QR, obtained by Householder transformation, for iterations. The precision of iterations QR to determinating eigenvalues its fix in 0.00001.

Important:
Each component enters a value of lesser variance than you antecessor, being that variance was not entered before!
|
Important:
The correlations matrix obtained by the analysis of data or directly informed must be positive-defined! Case contrary, will exists eigenvalue lesse than 0 and results of PCA analysis are not valid! In case of negative eigenvalues, the application will stopped!
|
- Difference: indicates the variation between succesive eigenvalues. Not presents values for the last component.

- Proportion: indicates the percentual of variance of each component represented over the total variance.

- Accumulated: indicates the accumulated percentual of variance until the actual component.

Important:
The analysis of proportion and accumulated proportion allows using proportion criteria for PCA technique.
|
- Factors: table containing the k loading factors for each variable - generated by the criteria informed. Represents the correlation between each factor and each variable (the factors square represent the determination coefficient or the variance percentage of each variable explained by each factor). If the determination criteria its Kaiser eigenvalue bigger than 1, only are considered for determination of factors which eigenvalue bigger than 1.

- PCA: table containing the k principal components or k load factors normalized. The principal components are equivalent to the load factors divided by the square root of respective eigenvalues. Also are equivalent to the eigenvalue of correlations matrix.

Important:
Does not exist vectorial differences between Factors and PCA. The PCA its only a load factors normalization. Both Factors and PCA can be used as result of principal component analysis. To realize comparisons between principal componets, is interesting observe the normalized load factors!
|
Important:
By original data normalization by the correlations matrix, the principal components are generated over data with mean 0 and variance 1. After generate new data through load and scores factors, this mustbe converted to the mean and variance from teh actual variable!
|
- Exp. Var.: informs the explained variance for each principal component. Corresponds to th same variance explained in the eigenvalues of principal components.

- Communality: variance percentual of a variable which is entered by principal components or load factors determined. Its equivalent to the sum of squares of the load factors of a variable.

- Total Var.: the total variance extracted by the selected principal components. Its equal to the sum of variances explained for each component selected and also equals the sum of communallities of all variable. Or, the total explained variance by the principal components its equal to the sum of variable variance percentual explained by this components.

- VARIMAX: matrix of load factors with VARIMAX orthogonal rotation to facilitate interpretating factors. Represents a data set of additional data generated according the precision parameters and iterations pointed in options of PCA command to VARIMAX rotation. Contain:
- Orthogonal rotation matrix ORT.: linear transforming aplied to the original load factors, obtained by VARIMAX;
- Matrix of load factors rotationed;

- The variance explained by load factors rotationed;
- Commnality between variables and load factors rotationed; and
- Total variance explained by principal components rotationed.
The VARIMAX rotation its one of more commom of orthogonal rotation and its originally atributed to Henry Kaiser.
The purpose of rotation process its only make easy to interpret the results. Thus, exists changes in the load factors and in the variances explained by each load factor. Althouth, the total variance and the communalities - variance percentual of a variable enteredby the load factors - remain equal!
In studies about financial-economic model (APT, in example), the rotation process have big utility in variable interpretation.
Example using Yield curve:
The following example seeks to illustrate a PCA using in Yield curves. The objective is reduce the variables which compose the curve allowing the simulation of the yield curve. The data for the vertices are near the behavioring of the yield curve from 2001, but must be considered hypothetical here.
- Proportion Criteria 98% (new variables must compute for 98% of the variation of all yield curves)
- Not will be used the VARIMAX rotation - will not be necessary interpret the principal components generated
- Scree test building
- Brazilian yield curve for 2001 with many vertices. Range I1:O249
By the attractiveness of the logarithmic operator for determinating statistics parameters for returns of finacial assets, will be applied a logarithmic operator to all data.
Results:
|
|
Total of days and variables analyzed.
| V1 | V2 | V3 | V4 | V5 | V6 | V7 |
| Means | -1.693 | -1.663 | -1.622 | -1.556 | -1.509 | -1.452 | -1.470 |
| St. Dev. | 0.119 | 0.133 | 0.148 | 0.169 | 0.175 | 0.203 | 0.179 |
|
| Correlações | V1 | V2 | V3 | V4 | V5 | V6 | V7 |
| V1 | 1.000 | 0.991 | 0.980 | 0.958 | 0.936 | 0.922 | 0.929 |
| V2 | 0.991 | 1.000 | 0.994 | 0.976 | 0.956 | 0.930 | 0.939 |
| V3 | 0.980 | 0.994 | 1.000 | 0.990 | 0.975 | 0.951 | 0.951 |
| V4 | 0.958 | 0.976 | 0.990 | 1.000 | 0.994 | 0.975 | 0.963 |
| V5 | 0.936 | 0.956 | 0.975 | 0.994 | 1.000 | 0.982 | 0.965 |
| V6 | 0.922 | 0.930 | 0.951 | 0.975 | 0.982 | 1.000 | 0.951 |
| V7 | 0.929 | 0.939 | 0.951 | 0.963 | 0.965 | 0.951 | 1.000 |
|
Statistics indicators for data (Napier's logarithmic of yield curve).
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| Eigenvalue | 6.774 | 0.134 | 0.054 | 0.028 | 0.004 | 0.003 | 0.002 |
| Difference | 6.640 | 0.080 | 0.025 | 0.024 | 0.001 | 0.001 | |
| Proportion | 96.78% | 1.91% | 0.77% | 0.41% | 0.06% | 0.04% | 0.03% |
| Accumulated | 96.78% | 98.69% | 99.46% | 99.87% | 99.93% | 99.97% | 100.00% |
|
Table of eigenvalues of correlations. Perceives that first principal component its responsible for more than 90% of teh total variance of the yield curve.
Criteria: Proportion 98%
|
| Factors | F1 | F2 | | PCA | PCA1 | PCA2 |
| V1 | 0.975 | -0.199 | | V1 | 0.375 | -0.545 |
| V2 | 0.986 | -0.163 | | V2 | 0.379 | -0.445 |
| V3 | 0.993 | -0.086 | | V3 | 0.382 | -0.234 |
| V4 | 0.996 | 0.037 | | V4 | 0.383 | 0.102 |
| V5 | 0.989 | 0.120 | | V5 | 0.380 | 0.327 |
| V6 | 0.974 | 0.177 | | V6 | 0.374 | 0.484 |
| V7 | 0.973 | 0.115 | | V7 | 0.374 | 0.314 |
| Exp. Var. | 6.774 | 0.134 | | | | |
|
As the criteria used was the proportion with a cut of 98%, was included two principal factors. The tables above show the load fatcors and the eigenvectors or principal components normalized.
|
| Communality | Ci |
| V1 | 0.991 |
| V2 | 0.998 |
| V3 | 0.994 |
| V4 | 0.993 |
| V5 | 0.992 |
| V6 | 0.981 |
| V7 | 0.959 |
| | |
| Total Var. | 6.908 |
|
Communality Table - variance percentual of each interest point entered by the principal components choosen - and total variance.

Autovalores=Eigenvalues ; Fator=Factor
Scree test chart enhancing bigger partcipation of the first principal component on all variable explaining.
Same results can be obtained if was given directly the correlation matrix.
Its important to show up that principal components generated - new variables - can be used, in example, to simulate the yield curve: can determinate the score of each component through original data and from load factors aplications (or PCA's) over these data. Once with this scores, can determinate the statistics parameters with new variable and use a oscillations model (Monte Carlo, in example) for this parameters. Allowing to set all the yield curve based in this oscilations.
Once more, the principal components are generated over normalized data (using correlation between data). In this example, the data not received a logarithmic operator apply yet. To simulate, in example, a point for the 21 days vertice from the principal components (factors and scores), its necessary apply in the result the normalization inverse, that is, multiply by the standard deviation and add the mean, both relating to the vertice 21. In the end must invert the logarithmic operator, that is, calculate the results exponential.
The chart below represents the original yield curve for the vertices 21, 42 and 63 and the yield curve mounted only from the PCA1 component (96.78% from the variance its explained by this component).

PCA1 x Dados Originais=PCA1 x Original Data ; Taxa=Rate ; Data=Date
Perceives the varaince explanation power for all yield cuver from only one principal component.
|
|