You are on page 1of 11

Instructions on how to run ParLeS

1. Data file format


Before you can run ParLeS you will need to format your data correctly.

Please read carefully. For more detailed instructions, with figures, see the document in
the \Program Files\ParLeS\Help directory of your computer.

a. For the Calibration data:


" First row must contain header information, i.e. labels that include those for your
samples ('S1', 'S2', etc), your response variables (e.g. 'carbon', 'pH', etc) and labels for
your predictor variables (e.g. the wavelengths/wavenumbers).
" First column should contain sample labels
" The next column(s) should contain response variables (i.e. the y-variables). Note
that ParLeS accepts more than one response variable (see below).
" The columns after that should contain the predictor variables (i.e. the X-data e.g.
NIR spectra)

Example of format for calibration file, containing labels, a single response variable
(OC) and NIR spectra (700-2500nm):

Label OC 700 702 2500


S1 2.56 0.35 0.37 0.67
S2 1.35 0.32 0.33 0.62
.
.
.
Etc.

Note: you can have more than 1 response variable in your files. You will be asked to
select the y-variable you want to model or test in the appropriate sections of the
software.
If you have more than one response variable then place them in the second third, etc.
columns after the sample labels and before the predictor variables (i.e. before the X-
data).

Example of format for calibration file, containing labels, three response variables (OC,
pH and N) and NIR spectra (700-2500nm):

Sample name OC pH N 700 702 2500


S1 2.56 6.5 1.3 0.35 0.37 0.67
S2 1.35 7.3 1.8 0.32 0.33 0.62
.
.
.
Etc.

b. For the Prediction/test data


" As for calibration data, but a prediction file requires a column of zeroes
replacing the y-variable, as this will be your unknowns that you want to predict using
your model.

Example of format for prediction file:

Label OC 700 702 2500


S1 0 0.35 0.37 0.67
S2 0 0.32 0.33 0.62
.
.
.
Etc.

" If you want to test your models with independent test data then your file format
will be as in a. above, i.e. including the response variable data to be used to test the
models. Remember to also include headers as in a. above.
Prepare your data files for import into ParLeS and save them as
tab delimited (ASCII) text files.

2 Importing data into ParLeS

a. Import data for modelling


In this tab, you may choose to:
(i) select a file (with the above formatting) for modelling, or,

(ii) by checking the box labelled 'Check to join files from a directory' select to merge
multiple spectroscopic files with x,y format (e.g. where x is frequency and y is
reflectance), into a single file.

If you select (i) then:


- Use the 'Get file for modelling' browse folder button to select the path and your data
file.
- Press the 'IMPORT DATA FOR MODELLING' button and you will see the header
information of your file
- Using the numeric control 'Total number of y variables', select the total number of
response variables in your file. For example, if you have 3 response variables as in the
example above then you will write '3' in this numeric control. If you only have 1
response variable then write '1'.
- Using the numeric control 'Select y variable for modelling', select the response
variable you want to model. Remember that ParLeS uses the PLSR1 algorithm, i.e.
models a single y-variable at a time. Following from the above example, if you want to
model pH then you would write '2' in this control, if you want to model N then you
write '3'.
- Press the 'IMPORT DATA FOR MODELLING' button and if your data file is correctly
formatted and you have correctly identified the total number of y-variables and the y-
variable you want to model then you should see a sample of your data in the windows
labelled 'y variables', 'Labels', 'Selected y' 'X variables' and ' '. You will also see a
sample of your spectra on the graph.
- If you cannot see the correct data in the windows or you cannot see the spectra then
before you proceed you will need to check your data file and if necessary remake the
file by carefully following the above instructions.
If the file format is incorrect or you have incorrectly identified the total number of y
variables in your data file then you will be able to see this in the sample data windows
and more than likely your spectra will not plot correctly.

If you select (ii) then the files to be merge need to be:


- in a single directory which you specify in the 'Directory with files to join' control (the
best thing to do is to copy the file path rather than use the browse button), and
- of the same type, i.e. with the same file extension protocol, which you specify in the
'File extension' control. For example text files will have a '.txt' extension. Note that the
files should be in ASCII format.

You may then run the program using the 'IMPORT DATA FOR MODELLING' button.
Once the software has run, a sample of the merged spectra will be displayed. This may
take some time depending on the number of files that you have. If the sample spectra do
not appear to plot properly, then an error has occurred and you should check that you
have the correct directory or that you have the correct file extension.
The merged file may be saved by checking the 'SAVE MERGED FILE' control or it
may be further analysed in ParLeS (see below).

b. Import data for prediction


In this tab you may select a file for: (i) prediction of unknowns or (ii) to test your
models with independent test data. Refer to file format instructions in 1 above.

For the prediction of 'unknowns' the file requires a column of zeroes replacing the y
variable. In this case your 'Total number of y variables' will be '1' and the 'Select y
variable for prediction' will also be '1'.

For testing your models with independent test data:


- Using the numeric control 'Total number of y variables', select the total number of
response variables in your test file.
- Using the numeric control 'Select y variable for modelling', select the response
variable for which you want to test your model.
- Press the 'IMPORT DATA FOR MODELLING' button and if your data file is correctly
formatted and you have correctly identified the total number of y-variables and the y-
variable you want to test then you should see a sample of your data in the windows
labelled 'y variables' (this is 'All test y variables' in earlier versions of ParLeS), 'Labels',
'Selected y' 'X variables' and ' '. The graph will show a sample of the spectra used for the
predictions.
- If you cannot see the correct data in the windows or you cannot see the spectra then
before you proceed you will need to check your data file and if necessary remake the
file by carefully following the above instructions.

If the file format is incorrect or you have incorrectly identified the total number of y
variables in your data file then you will be able to see this in the sample data windows
and more than likely your spectra will not plot correctly.

3. Data transformations, preprocessing and pretreatments


The 'Data Manipulations' tab (called 'Preprocessing' in earlier versions of ParLeS) can
be used to transform, preprocess and pretreat your spectra.

From the drop-down menus select the desired combination of transformation,


preprocessing and pretreatment to apply. You can test any combination of methods as
long as you understand what they do and you carefully follow the instructions.

The dropdown menus you can perform the following transformations and
preprocessing:
o Data transformation - transform diffuse reflectance (R) data to Log(1/R) or
Kubelka-Munk units K/S = (1-R)^2/2R. You may also transform from Log(1/R) to R.
o Light scatter and baseline corrections - correct data for light scattering effects,
etc. using Multiplicative Signal Correction (MSC), Standard Normal Variate (SNV),
SNV with quadratic detrending, Wavelet de-trending or SNV with wavelet detrending.
The wavelet de-trending level specifies the number of levels of the wavelet
decomposition, which is approximately (1 - trend level*log2(Ls), where Ls is the signal
length. When trend level is zero, signal trend is equal to zero, and signal detrended is
identical to signal in. It may be thought of as a form of baseline correction.
o De-noising/Smoothing - de-noise data using a Median filter or the Savitzky-
Golay or Wavelet de-noising. For the Median Filter select the rank to be used in the
filtering. For the Savitzky-Golay first select the number of data points to fit the curve
and then the order of the polynomial you wish to fit. For the Wavelet de-noising select
the desired wavelet scale for de-noising. ParLeS uses a Daubechies wavelet with 4
vanishing moments.
o Differentiation - correct the data for baseline, particle size, etc. using first or
second derivatives together with the desired sampling interval.
The software also offers a number of methods for pretreating the predictor data. Using
the drop-down menu you can select which data pretreatment (or enhancement) to use
before you move onto the multivariate modelling. The choices include:
- Mean centre,
- Variance scale,
- Mean centre & variance scale

NOTE: it is common practice, although not imperative, to 'Mean Centre' your data
before PCA and PLSR

Once the particular combination is selected, press the 'RUN SELECTION' button. The
first graph will show your raw data and the graph on the bottom part of the ParLeS
window will show you the combined transformed, preprocessed and pre-treated spectra.
You may investigate the effect of each algorithm separately by selecting it and then
pressing the 'RUN SELECTION' button.

For example if you have diffuse reflectance data you may choose to transform these to
Log(1/R); correct for light scattering effects using the MSC; de-noise your signal using
the wavelet de-noising at scale = 2; take the first derivative and mean centre your data
before you perform PCA or PLSR.

You can save the manipulated data to a file using the 'SAVE MANIPULATED DATA'
(called the 'SAVE PREPROCESSED DATA' in earlier versions of ParLeS). The saved
file will be a tab delimited text file.

4. Principal Components Analysis (PCA)


ParLeS implements an iterative PCA algorithm based on the NIPALS algorithm
described in Martens & Naes (1989).
In the PCA tab, using the numeric control or slide bar you need to select the maximum
number of PCA components to calculate.

The progress bar shows the component currently iterating.


The results from the PCA are displayed in a number of graphics that include:
- the loadings vs. wavelength/wavenumber plot. Using the numeric controls on this plot
you may select to view the loading for each principal component separately or all
loadings simultaneously
- the scores vs. scores plot. Using the numeric controls on this plot you may select the
scores for the principal component that you want to plot.
- the loadings vs. loadings plot. Using the numeric controls on this plot you may select
the loadings for the principal component that you want to plot.
- the percent variation of the predictor data that is explained by each component

Note that in ParLeS version 3.1 you can interact with the scores vs. scores and loadings
vs. loadings plot. Glide your mouse over the data points and click on the point that you
want to identify. The point will change colour and its label will be briefly displayed on
the graph.

The PCA scores and loadings can be saved to tab delimited text files by checking the
'SAVE PCA SCORES & LOADINGS' check box. Two separate dialogues will appear
once you check to save: the first will ask you to give a name for the scores file and the
second will ask you to provide a name for the loadings file.

5. Jackknife cross validation


The cross validation tab can be used to help determine the optimal number of PLSR
factors to model. The results are shown in a number of graphics showing appropriate
assessment statistics.
In the PLSR Cross validation tab, using the numeric slide bar or control select the
maximum number of factors for the leave-one-out cross validation.

With large data sets it may be too computationally expensive to use leave-one-out so
you could for example use leave-ten-out. To do this, type the number of samples 'n' to
leave out. To help you decide, the total number of samples in your dataset are given in
the numeric indicator 'No. Samples'.

To start the cross validation, press the 'RUN X-VAL' button. The progress bar indicates
how much of the data has been cross validated.
The results of the cross validation is displayed in the following graphics:
- the root mean squared error of cross validation (RMSE) vs. the number of factors
- R2 and Q2 statistics vs. the number of factors
- the Akaike Information Criterion (AIC) vs. the number of factors. Note the AIC
preserves model parsimony.
- the observed vs. cross validation predictions for a selected number of factors, where
the user may select the cross validate predictions to plot using the numeric control
'Select X-Val model to plot'. The fitted line and equation are also given. For this cross
validated model, various assessment statistics are given: R2, R2adjusted, RMSE, mean
error (ME) the standard deviation of the error (SDE) and the RPD.

The cross validation results can be saved by checking the 'SAVE X-VAL RESULTS'
check box. Two separate dialogs will appear once you check to save: the first will ask
you to give a name for the assessment statistics file and the second will ask you to
provide a name for the observed vs. cross validation predictions for the selected number
of factors.
.
Note if you do not need to cross-validate, proceed to the PLSR modelling tab.

6. Partial Least Squares Regression (PLSR)


The orthogonalised PLSR 1 algorithm implemented in ParLeS is that described by
Martens & Naes (1989). In the PLSR Modelling tab you may select the optimal number
of factors to model, using the slide bar or numerical indicator.

Once the number of factors to model are selected, run the software using the 'RUN
PLSR MODELLING' button. Results from the PLSR modelling are shown in a number
of graphs:
- Scores vs. scores plot
- Scores vs. y plot
- Regression coefficients (B) vs. wavelength/wavenumber plot
- Spectral loadings (P) and loading weights (W) vs. wavelength/wavenumber plot
- Variable importance for projection (VIP) vs. wavelength/wavenumber plot
- Sorted VIP and wavelength/wavenumber table
- the percent variation of each the predictor and response data that is explained by each
factor in the PLSR model

Note that in ParLeS version 3.1 you can interact with the scores vs. scores; scores vs. y
plot; regression coefficients vs. wavelength/wavenumber plot and the VIP vs.
wavelength/wavenumber plot. Glide your mouse over the data points and click on the
point that you want to identify. The point will change colour and its label will be briefly
displayed on the graph.
The PLSR model (scores, regression coefficients (b), the intercept (b0), spectral
loadings and loading weights) as wells as the VIP results can be saved to tab delimited
text files by checking the 'SAVE SCORES; b, b0, p, w; and VIP' check box. Three
separate dialogues will appear once you check to save: the first will ask you to give a
name for the PLSR scores file; the second will ask you to provide a name for the
regression coefficients and the third for the VIP results.

7. Prediction
To make PLSR predictions press 'RUN PREDICTIONS' to run the PLSR predictions
using the selected model selected in the 'PLSR Model' tab (see 6. above). The program
will run and results and assessment statistics will be displayed.

The results from the PLSR predictions are displayed in a number of graphics and
assessed using various statistics:
- a sample of the spectra used for predictions
- the predicted values
- when using a test data set, the residuals (observed - predicted)
- when using a test data set, the observed vs. predicted and the fitted line, also showing
its equation
- the following assessment statistics: R2, R2adjusted, RMSE and confidence intervals,
mean error (ME) the standard deviation of the error (SDE) and the RPD
- a histogram of the predicted values and their descriptive statistics

The predictions can be saved to a file using the 'SAVE PREDCITIONS' check-box.

8. Bootstrap aggregation-PLSR or (bagging-PLSR)


To make the bagging-PLSR predictions first you need to select the number of bootstraps
to use for bagging (the default is 30 bootstraps) as well as the number of PLSR factors
to use. Then press the 'RUN BAGGING-PLSR' button. The program will run and results
and assessment statistics will be displayed.

The results from bagging-PLSR are displayed in a number of graphics and assessed
using various statistics:
- the observed vs. predicted from the bootstraps
- the out-of-bag statistics, which may also be used to evaluate the models
- a plot of the predicted values and their 95% confidence intervals
- the descriptive statistics of the predictions
- the observed vs. predicted and the fitted line, also showing its equation
- the following assessment statistics: R2, R2adjusted, RMSE and confidence intervals,
mean error (ME) the standard deviation of the error (SDE) and the RPD

The bagging-PLSR predictions and confidence intervals can be saved to a file using the
'SAVE BAGGED' check-box.

Once finished you can exit ParLeS using the 'EXI PROGRAM' button.

9. Errors
If incorrect file format, the software will not run, or run incorrectly.

10. Conditions of use


Please refer to ParLeS license agreement.

You may not use the software for commercial purposes, unless you have obtained
permission, in writing, from Raphael VISCARRA ROSSEL (r.viscarra-
rossel@usyd.edu.au or tel. +61 413 326 457)

If the ParLeS is used in research you agree to cite the following reference:

Viscarra Rossel, R.A. 2007. ParLeS: Software for chemometric analysis of


spectroscopic data. Chemometrics and Intelligent Laboratory Systems (in-press) doi:
10.1016/j.chemolab.2007.06.006

For more up to date citation information you may also visit:


http://www.usyd.edu.au/su/agric/acpa/people/rvrossel/Publications.htm

I will appreciate comments/ suggestions for further improvements to ParLeS. In essence


ParLeS is still under development.
11. Disclaimer
I have taken all care to ensure that ParLeS is operationally sound. However, it is
supplied 'as is' and no warranty is provided or implied. I assume no liability for
damages, direct or consequential that may result from its use.

2007 R. VISCARRA ROSSEL, The University of Sydney

You might also like