Open Source GIS, DATAMINING and Statistics: Dr. V.V. Venkata Ramana

OPEN SOURCE
GIS, DATAMINING and Statistics
Dr. V.V. VENKATA RAMANA

 The COIN-OR initiative The Common Optimisation INterface for
Operations Research (COIN-OR) is a broad initiative to advance
open source for the operations research community.
 The main thrust of it is to build an open-source repository of OR

software with the expectation of reaping analogous community
benefits.
 A repository cannot be created and sustained without a community.
 To that end, COIN-OR serves to educate, to promote awareness, to

provoke discussions, to encourage developers and users, and to
otherwise build an open-source community for OR.
The public initiative was spearheaded by the IBM Research Division
and kicked off with a conference presentation , the first organisational
meeting, and the launching of the project's Web site (http://www.coin-
or.org/) at the 17th International Symposium for Mathematical
Programming in August 2000.
The software repository on the COIN-OR Web site was seeded with
two state-of-the-art projects opened by IBM Research under the OSI-
certified IBM Public License (IPL), and with two newly initiated projects.
COIN-OR is intended for all aspects of OR; however, the four initial
contributions featured tools for large-scale mixed-integer linear
programming and combinatorial optimisation.
Operations Research
OpenForecast: I have not used this, but it looks promising.
FLOPC++: An algebraic modelling language implemented as a

C++ class library.
Zimpl: language to translate the mathematical model of a

problem into a linear or (mixed-) integer mathematical
program expressed in .lp or .mps file format.
Cliquer: routines for clique searching.

Mathematical/Statistical applications
R: statistics, graphics, and more. Similar to S-plus (both are

based on the language S).
Maxima: computer algebra, similar to Mathematica or Maple.
Octave: matrix based mathematics, similar to and "mostly

compatible" with MATLAB.
GNU Scientific Library: C library for mathematical functions,

including random variables, statistics, linear algebra, etc.
PSPP: similar to SPSS. It is not too far along in its

development, however.
Open source
•Apophenia - a library of statistical functions for C, on the same
level of abstraction as most stats packages.
•Bayesian Filtering Library
•DAP - A free replacement for SAS
•gretl - Gnu Regression, Econometrics and Time-series Library
•JMulTi
•OpenBUGS
•OpenEpi - A web-based, open source, operating-independent
series of programs for use in epidemiology and statistics
•Ploticus - software for generating a variety of graphs from raw
data
Open source
R Commander - GUI interface for R
Shogun, an open source Large Scale Machine Learning
toolbox that provides several SVM (Support Vector
Machine) implementations (like libSVM, SVMlight) under a
common framework and interfaces to Octave, Matlab,
Python, R
SOCR
Statistical Lab - R-based and focusing on educational
purposes
WinBUGS
Xlisp-stat
Public domain
CSPro
Epi Info
X-12-ARIMA
Freeware
ADMB
BV4.1
GeoDA
Winpepi - package of statistical programs for epidemiologists
WinIDAMS
Zaitun Time Series
surveys
Add-ons
Analyse-it - add-on to Microsoft Excel for statistical analysis
SigmaXL - add-on to Microsoft Excel for graphical and statistical analysis
SPC XL - add-on to Microsoft Excel for general statistics
SUDAAN - add-on to SAS and SPSS for statistical
•Algebra@ (52) •Geometry@ (27)

•Algebraic Geometry@ (9) •Graphing (61)
•Calculators (64) •Logic@ (28)
•Calculus@ (9) •Mathematical Economics and
•Chaos and Fractals@ (64) Financial Mathematics@ (4)
•Coding Theory@ (4) •Number Theory@ (56)
•Combinatorics@ (91) •Numerical Analysis@ (39)
•Computational Geometry@ (22) •Simulators (3)
•Differential Equations@ (7) •Statistics@ (185)
•Educational@ (100) •Topology@ (11)
•Finite Element Analysis@ (82) •Typesetting (20)
 OpenStat is a general-purpose statistics package that you can
download and install for free. It was originally written as an aid in the
teaching of statistics to the students enrolled in a social science
program. It has been expanded to provide procedures useful in a wide
variety of disciplines. It is not a "finished" product but revised several
times a year. The version is denoted by the month, day and year.
 For example 1.27.08 would indicate a revision released in January 27,

2008. The program is NOT to be used for commercial purposes and
there is no warranty implied. Most users check results by hand or
compare to commercial packages to which they have access to insure
the results are correct.
 One join the OpenStat discussion group sponsored by Yahoo Groups.

You can find it at: http://tech.groups.yahoo.com/group/OpenStat That
site provides users an opportunity to exchange ideas and problems
they may encounter in using OpenStat. From time to time there are
tutorials that may help you in your research or instructional activities.
 Web Mapping Desktop Applications
 deegree GRASS GIS
 geomajas Marble
QGIS
 GeoMoose
 GeoServer Geospatial Libraries
 Mapbender FDO
GDAL/OGR
 MapBuilder GEOS
 MapFish GeoTools
 MapGuide Open Source OSSIM
PostGIS
 MapServer
 OpenLayers Metadata Catalogs
GeoNetwork
GIS Packages
Desktop
• Open Jump
• SVGIS
• Quantum GIS
Server
• Map Server
• Geo Server
•GRASS
 Mapserver
 Geo Server
 Mapguide
 Open Jump
 GRASS
 QGIS
MapServer Suite Products
Since the MapServer 6.0 release, the MapServer project includes a

suite of Open Source products, to provide a full set of online
mapping tools to the community.
The MapServer Project Steering Committee maintains all of these

products under the single umbrella of MapServer.
MapServer Core
The MapServer core source code, written in C, and consistently is one of the fastest and most configurable
online mapping engines in the world.
•Documentation home :
http://mapserver.org/documentation.html#documentation
•Download : http://mapserver.org/download.html#source
•Github home : https://github.com/mapserver/mapserver/
 MapCache
As of MapServer 6.0, MapServer also includes powerful tile caching
capabilities through the MapCache project.
 Documentation home :
http://mapserver.org/mapcache/index.html#mapcache
 Download : http://mapserver.org/download.html#source
 Github home : https://github.com/mapserver/mapcache/
 TinyOWS
As of MapServer 6.0, MapServer also includes the much needed
ability to perform transactional requests (online editing of
features) through the WFS specification, using the TinyOWS
project.
 Documentation home :
http://mapserver.org/tinyows/index.html#tinyows
 Download : http://mapserver.org/download.html#source
The basic architecture of MapServer applications
It is a popular Open Source project whose purpose is to
display dynamic spatial maps over the Internet.
 Some of its major features include:

 support for display and querying of hundreds of raster,
vector, and database formats
 ability to run on various operating systems (Windows,
Linux, Mac OS X, etc.)
 support for popular scripting languages and development
environments (PHP, Python, Perl, Ruby, Java, .NET)
 on-the-fly projections
 high quality rendering
 fully customizable application output
 many ready-to-use Open Source application environments
its most basic form, MapServer is a CGI
In
program that sits inactive on your Web server
When a request is sent to MapServer, it uses

information passed in the request URL and the
Mapfile to create an image of the requested
map
Therequest may also return images for

legends, scale bars, reference maps, and
values passed as CGI variables.
 MapServer can be extended and customized
through MapScript or templating
 It can be built to support many different vector and

raster input data formats, and it can generate a
multitude of output formats
 Most pre-compiled MapServer distributions contain

most all of its features
 Map File - a structured text configuration file for your MapServer application.
It defines the area of your map, tells the MapServer program where your data
is and where to output images. It also defines your map layers, including
their data source, projections, and symbology. It must have a .map extension
or MapServer will not recognize it
 Geographic Data - MapServer can utilize many geographic data source types.
The default format is the ESRI Shape format. Many other data formats can be
supported, this is discussed further below in Adding data to your site
 HTML Pages - the interface between the user and MapServer . They normally
sit in Web root. In it’s simplest form, MapServer can be called to place a
static map image on a HTML page. To make the map interactive, the image is
placed in an HTML form on a page
 CGI programs are ‘stateless’, every request they get is new and they don’t
remember anything about the last time that they were hit by your
application. For this reason, every time your application sends a request to
MapServer, it needs to pass context information (what layers are on, where
you are on the map, application mode, etc.) in hidden form variables or URL
variables.
may include two HTML pages:
◦ Initialization File - uses a form with hidden variables to

send an initial query to the web server and MapServer. This
form could be placed on another page or be replaced by
passing the initialization information as variables in a URL
◦ Template File - controls how the maps and legends output
by MapServer will appear in the browser. By referencing
MapServer CGI variables in the template HTML, you allow
MapServer to populate them with values related to the
current state of your application (e.g. map image name,
reference image name, map extent, etc.) as it creates the
HTML page for the browser to read. The template also
determines how the user can interact with the MapServer
application (browse, zoom, pan, query)
 MapServer CGI - The binary or executable file
that receives requests and returns images, data,
etc. It sits in the cgi-bin or scripts directory of
the web server. The Web server user must have
execute rights for the directory that it sits in, and
for security reasons, it should not be in the web
root. By default, this program is called mapserv
 Web/HTTP Server - serves up the HTML pages

when hit by the user’s browser. You need a
working Web (HTTP) server, such as Apache or
Microsoft Internet Information Server, on the
machine on which you are installing MapServer.
 Download OSGeo4W
http://download.osgeo.org/osgeo4w/osgeo4w-
setup.exe
 Execute (double-click) the .exe
 Choose “Advanced” install type
Express contains options for higher-level
packages such as MapServer, GRASS, and
uDig.
Advanced gives you full access to choosing

commandline tools and applications for
MapServer that are not included in the
Express install
Click on the “Default” text beside the higher-level packages (such as
Web) to install all of Web’s sub-packages, or click on the “Skip” text
beside the sub-package (such as MapServer) to install that package and
all of its dependencies.
•Run the apache-install.bat script to install the Apache Service.
Note
You must run this script under the “OSGeo4W Shell”. This is usually
available as a shortcut on your desktop
An apache-uninstall.bat script is also available to remove the Apache

service installation.
•Start Apache from the OSGeo4W shell and navigate to http://127.0.0.1

•apache-restart.bat
 You need a working and properly configured Web
(HTTP) server, such as Apache or Microsoft Internet
Information Server, on the machine on which you
are installing MapServer
 OSGeo4W contains Apache already, but you can

reconfigure things to use IIS if you need to.
Alternatively, MS4W can be used to install
MapServer on Windows.
http://mapserver.org/
 written in Java, allows users to share and edit geospatial
data. Designed for interoperability, it publishes data from
any major spatial data source using open standards.
 Being a community-driven project, GeoServer is

developed, tested, and supported by a diverse group of
individuals and organizations from around the world.
 GeoServer is the reference implementation of the Open

Geospatial Consortium (OGC) Web Feature Service (WFS)
and Web Coverage Service (WCS) standards, as well as a
high performance certified compliant Web Map Service
(WMS). GeoServer forms a core component of the
Geospatial Web.
 http://docs.geoserver.org/stable/en/user/
 There are many ways to install GeoServer on your system. This
section will discuss the various installation paths available.
 Windows
◦ Windows Installer
◦ Windows Binary
 Mac OS X
◦ Mac OS X Installer
◦ Mac OS X Binary
 Linux
◦ Debian
 Web archive (WAR)
◦ Java
◦ Installation
◦ Running
◦ Uninstallation
 Upgrading
◦ Upgrade to 2.2
 Fully compliant to WMS (1.1.1 and 1.3), WFS (1.0 and 1.1, transactions and locking) and WCS
(1.0 and 1.1) specifications, as tested by the CITE conformance tests. GeoServer additionally
serves as Reference Implementation for WCS 1.1 and WFS 1.0 and 1.1
 Implemeting WPS 1.0 (OGC does not provide a test suite providing proof of compliance at the
time of writing)
 Easy to use web-based configuration tool - no need to touch long, complicated config files.
 Mature support for PostGIS, Shapefile, ArcSDE, DB2 and Oracle.
 VPF, MySQL, MapInfo, and Cascading WFS are also supported formats.
 Native Java support for GeoTIFF, GTOPO30, ArcGrid, WorldImages, ImageMosiacs and Image
Pyramids
 Support for MrSID, ECW, JPEG2000, DTED, Erdas Imagine, and NITF through [GDAL ImageIO
Extension]. Any format that GDALsupports can be added with a bit of coding.
 On the fly reprojection, for WMS and WFS, with an embedded EPSG database supporting
hundreds of projections by default.
 Web Map output as JPEG, GIF, PNG, PDF, SVG, KML, [GeoRSS].
 Excellent [Google Earth Support], including advanced features like super overlays (vector and
raster), 2.5D extrudes, Time, advanced template options for pop-ups and titles, and styling
with SLD.
 Ability to 'publish' data to Google's geo crawlers, so data from GeoServer can be exposed on
Google Maps and Earth searches.
AJAX based MapGuide Viewer
The AJAX Viewer provides map display and
interaction in almost any browser, including
Safari, without having to download a browser
plug-in. This viewer ensures that any user on any
platform can access designs and maps without
requiring a specific browser.
DWF based MapGuide Viewer
The DWF Viewer uses an ActiveX control to provide
map display and interaction on Windows systems
running Internet Explorer. This gives users powerful
yet lightweight viewing of maps, designs, and
related data. Use of DWF technology also provides
high quality printing and plotting, as well as
support for a “disconnected mode” that makes it
easy to take spatial data into the field.
Spatial Analysis and Reporting
MapGuide Open Source includes a full suite of
geospatial analysis capabilities – here, creating
buffer zones around a selected parcel.
MapGuide Maestro
MapGuide Maestro is a free application that can ease
the management of spatial data in MapGuide Open
Source. It is an open source GUI tool and client.
Web Based Site Admin Application

MapGuide Open Source includes a browser-based tool that
allows remote administration and configuration of servers.
Autodesk MapGuide Studio
It is designed to work with MapGuide Open
Source. It is a complete authoring application
that can be used to load and configure spatial
data sources, produce attractive thematic maps,
define the user interface elements present in the
viewer, and integrate application logic written in
PHP, ASP.NET, or JSP.
Google Earth as a MapGuide Client

MapGuide Open Source can use Google Earth as
a client by taking advantage of Google Earth's
Network Links feature and the MapGuide Web
API's. Here we see parcel boundaries served
from a MapGuide Open Source web service and
delivered as a KML file to Google Earth for
display with other map data.
Java Unified Mapping Platform
 Jump came first, but development has slowed
down.
 Some enthusiastic users took the initiative to

continue JUMP development on their own – their
version is called OpenJUMP.
 OpenJUMP is an open source GIS software written

in Java. It is based on JUMP GIS by Vivid
Solutions
Capabilities :
 It is a vector GIS but can read rasters as well
 It works on Windows, Linux & Mac Platforms,
but should work on any operating system that
runs Java 1.4 or later
 It works with medium size databases
 It provides a GIS API with a flexible plugin
structure, so that new features are relatively
easy to develop around the sound mapping
platform
 It utilizes standards like GML (Geographical
Markup Language), WKT (Well Known Text).
 JUMP : Made byVivid Solutions – is the
mother of all JUMPs
 OpenJump from French Project SIGLE
 DeeJUMP : Made by Lat/Lon
 SkyJUMP: Made by ISA Inc.
 PirolJUMP: Made by German PIROL Project
 KosMo: Developed by SAIG company
WORKBENCH COMPONENTS
Layer Rendering
INSPECTING FEATURES
Drawing polygons
Layer Validation
Layer Statistics
 – OpenJUMP foss latest edition available at
http://www.projet-sigle.org,
 – PostgreSQL database and PostGIS extension
available at
 http://www.postgresql.org/,
 – Deegree tools (web services and web browser)
available at http://www.deegree.org,
 – OpenOffice 2.0 available at
http://www.openoffice.org/,
 – Inkscape available at http://www.inkscape.org/.
 And spatial data, available at http://www.projet-
sigle.org
• Geographic Resources Analysis Support System,
commonly referred to as GRASS GIS, is a Geographic
Information System (GIS) used for data management,
image processing, graphics production, spatial
modelling, and visualization of many types of data.
• It is Free (Libre) Software/Open Source released under

GNU General Public License (GPL) >= V2.
• GRASS GIS is an official project of the Open Source

Geospatial Foundation
• Originally developed by the U.S. Army Construction Engineering
Research Laboratories (USA-CERL, 1982-1995, as a tool for land
management and environmental planning by the military
• GRASS GIS has evolved into a powerful utility with a wide range of
applications in many different areas of applications and scientific
research
• GRASS is currently used in academic and commercial settings

around the world, as well as many governmental agencies
including NASA, NOAA, USDA, DLR, CSIRO, the National Park
Service, the U.S. Census Bureau, USGS, and many environmental
consulting companies
• The GRASS Development Team has grown into a multi-national

team consisting of developers at numerous locations
 Raster analysis: Automatic rasterline and area to vector conversion,
Buffering of line structures, Cell and profile dataquery, Colortable
modifications, Conversion to vector and point data format,
Correlation / covariance analysis, Expert system analysis , Map
algebra (map calculator), Interpolation for missing values,
Neighbourhood matrix analysis, Raster overlay with or without
weight, Reclassification of cell labels, Resampling (resolution),
Rescaling of cell values, Statistical cell analysis, Surface generation
from vector lines
 3D-Raster (voxel) analysis: 3D data import and export, 3D masks,

3D map algebra, 3D interpolation (IDW, Regularised Splines with
Tension), 3D Visualization (isosurfaces), Interface to Paraview and
POVray visualization tools
 Vector analysis: Contour generation from raster surfaces (IDW,
Splines algorithm), Conversion to raster and point data format,
Digitizing (scanned raster image) with mouse, Reclassification of
vector labels, Superpositioning of vector layers
 Point data analysis: Delaunay triangulation, Surface interpolation

from spot heights, Thiessen polygons, Topographic analysis
(curvature, slope, aspect), LiDAR
 Image processing: Canonical component analysis (CCA), Color
composite generation, Edge detection, Frequency filtering (Fourier,
convolution matrices), Fourier and inverse fourier transformation,
Histogram stretching etc
 DTM-Analysis: Contour generation, Cost / path analysis, Slope /

aspect analysis, Surface generation from spot heigths or contours
 Geocoding: Geocoding of raster and vector maps including (LiDAR)

point clouds
 Visualization: 3D surfaces with 3D query (NVIZ), Color assignments,

Histogram presentation, Map overlay, Point data maps, Raster maps,
Vector maps, Zoom / unzoom -function
 Map creation: Image maps, Postscript maps, HTML maps
 SQL-support: Database interfaces (DBF, SQLite, PostgreSQL, mySQL,

ODBC)
 Geostatistics: Interface to "R" (a statistical analysis environment),

Matlab, ...
 Furthermore: Erosion modelling, Landscape structure analysis,

Solution transport, Watershed analysis.
GRASS GIS 6 release introduced a new topological 2D/3D vector
engine and support for vector network analysis
Attributesare managed in a SQL-based DBMS (PostgreSQL,

mySQL, SQLite, ODBC, ...), by default in DBF format. A new
display manager has been implemented
The NVIZ visualization tool was enhanced to display 3D vector

data and voxel volumes. Messages are partially translated (i18N)
with support for FreeType fonts, including multibyte Asian
characters
New LOCATIONs can be auto-generated eg. by EPSG code

number using a location wizard. GRASS GIS is integrated with
GDAL/OGR libraries to support an extensive range of raster and
vector formats, including OGC-conformal Simple Features.
 GRASS GIS 7 is under development with a first release being
expected (snapshots are already available)
 It offers large data support, an improved topological 2D/3D

vector engine and much improved vector network analysis
 Attributes are managed by default in SQLite format.
 The display manager has been improved for usability
 The NVIZ visualization tool was completely rewritten. Image

processing has also been improved
 http://www.qgis.org/en/site/
for intro to GIS

 http://www.qgis.org/en/docs/gentle_gis_intr
oduction/index.html
QGIS Server
Desktop
QGIS Browser
On Android
Open Geospatial Consortium, Inc
www.opengeospatial.org
TOP 5
DATAMINING
OPEN SOURCE TOOLS
JHepWork
Data mining…
Data mining, a branch of computer science, is the process of

extracting patterns from large data sets by combining methods from
statistics and artificial intelligence with database management.
Data mining is seen as an increasingly important tool by modern

business to transform data into business intelligence giving an
informational advantage.
It is currently used in a wide range of profiling practices, such as

marketing, surveillance, fraud detection, and scientific discovery.
The premier professional body in the field is the Association for Computing
Machinery's Special Interest Group on Knowledge discovery and Data Mining
(SIGKDD). Since 1989 they have hosted an annual international conference and
published its proceedings, and since 1999 have published a biannual academic
journal titled "SIGKDD Explorations".
Other Computer Science conferences on data mining include:
•DMIN – International Conference on Data Mining;

•DMKD – Research Issues on Data Mining and Knowledge Discovery;
•ECML-PKDD – European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases;
•ICDM – IEEE International Conference on Data Mining;
•MLDM – Machine Learning and Data Mining in Pattern Recognition;
•SDM – SIAM International Conference on Data Mining
•EDM – International Conference on Educational Data Mining
•ECDM – European Conference on Data Mining
•PAKDD – The annual Pacific-Asia Conference on Knowledge Discovery and Data
Mining
4 Tasks of DATAMINING
There are four kinds of tasks that are normally involve in Data mining:
* Classification - the task of generalizing familiar structure to employ

to new data
* Clustering - the task of finding groups and structures in the data that
are in some way or another the same, without using noted structures in
the data.
* Association rule learning - Looks for relationships between variables.
* Regression - Aims to find a function that models the data with the
slightest error.
is a component-based data mining and machine learning software
suite that features friendly yet powerful, fast and versatile visual
programming front-end for explorative data analysis and visualization,
and Python bindings and libraries for scripting.
It contains complete set of components for data preprocessing, feature

scoring and filtering, modeling, model evaluation, and exploration
techniques.
It is written in C++ and Python, and its graphical user interface is based
on cross-platform Qt framework.
http://www.ailab.si/orange
Hierarchical clustering
Explorative analysis and classification trees
Linear projections (FreeViz)
Visualizing misclassifications
Data selection
Data sampling
Self-organizing maps
Classification tree viewer
Manual construction of classification tree
Data exploration by construction of analysis
schema
Distance map
Visualization of interactions of genetic pathways
A network of music performers
Parallel coordinates plot
Manual discretization of continuous features
Logistic regression and naive Bayesian
nomograms
Survey plot
Intelligent visualization with radviz
formerly called YALE (Yet Another Learning Environment), is an
environment for machine learning and data mining experiments that is
utilized for both research and real-world data mining tasks.
It enables experiments to be made up of a huge number of arbitrarily

nestable operators, which are detailed in XML files and are made with the
graphical user interface.
RapidMiner provides more than 500 operators for all main machine
learning procedures, and it also combines learning schemes and attribute
evaluators of the Weka learning environment.
It is available as a stand-alone tool for data analysis and as a data-mining

engine that can be integrated into your own products.
http://www.rapidminer.com/
Written in Java, Weka (Waikato Environment for Knowledge Analysis) is a
well-known suite of machine learning software that supports several typical
data mining tasks, particularly data preprocessing, clustering, classification,
regression, visualization, and feature selection.
Its techniques are based on the hypothesis that the data is available as a
single flat file or relation, where each data point is labeled by a fixed number of
attributes.
It provides access to SQL databases utilizing Java Database Connectivity and

can process the result returned by a database query.
Its main user interface is the Explorer, but the same functionality can be
accessed from the command line or through the component-based Knowledge
Flow interface.
http://www.cs.waikato.ac.nz/~ml/weka
12
University of Waikato 3/1/2019 6
JHepWork
Designed for scientists, engineers and students, jHepWork is a free and

open-source data-analysis framework that is created as an attempt to
make a data-analysis environment using open-source packages with a
comprehensible user interface and to create a tool competitive to
commercial programs.
It is specially made for interactive scientific plots in 2D and 3D and

contains numerical scientific libraries implemented in Java for
mathematical functions, random numbers, and other data mining
algorithms.
jHepWork is based on a high-level programming language Jython, but

Java coding can also be used to call jHepWork numerical and graphical
libraries.
JHepWork
http://jwork.org/jhepwork/
KNIME (Konstanz Information Miner) is a user friendly, intelligible, and
comprehensive open-source data integration, processing, analysis, and
exploration platform.
It gives users the ability to visually create data flows or pipelines, selectively
execute some or all analysis steps, and later study the results, models, and
interactive views.
It is written in Java, and it is based on Eclipse and makes use of its extension
method to support plugins thus providing additional functionality.
Through plugins, users can add modules for text, image, and time series
processing and the integration of various other open source projects, such as
R programming language, Weka, the Chemistry Development Kit, and
LibSVM.
http://www.knime.org/
An Example Data Analysis Workflow
An Example Data Analysis Workflow - VIEWS
An Example Data Analysis Workflow - Hiliting
Extending KNIME with New Data Types
Meta Nodes: Turning a Workflow into a Reusable Node
Integration of 3rd Party Packages
Statistics Packages
R Statistics
PSPP
 R is a language and environment for statistical computing and
graphics. It is a GNU project which is similar to the S language
and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers
and colleagues. R can be considered as a different
implementation of S.
 There are some important differences, but much code written for
S runs unaltered under R.
 R provides a wide variety of statistical (linear and nonlinear

modelling, classical statistical tests, time-series analysis,
classification, clustering, ...) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice
for research in statistical methodology, and R provides an Open
Source route to participation in that activity
 One of R's strengths is the ease with which well-designed
publication-quality plots can be produced, including
mathematical symbols and formulae where needed. Great
care has been taken over the defaults for the minor design
choices in graphics, but the user retains full control
 R is available as Free Software under the terms of the Free

Software Foundation's GNU General Public License in
source code form
 It compiles and runs on a wide variety of UNIX platforms

and similar systems (including FreeBSD and Linux),
Windows and MacOS.
 R is an integrated suite of software facilities for data
manipulation, calculation and graphical display. It
includes
 an effective data handling and storage facility,
 a suite of operators for calculations on arrays, in
particular matrices,
 a large, coherent, integrated collection of
intermediate tools for data analysis,
 graphical facilities for data analysis and display either
on-screen or on hardcopy, and
 a well-developed, simple and effective programming
language which includes conditionals, loops, user-
defined recursive functions and input and output
facilities.
The term "environment" is intended to
characterize it as a fully planned and
coherent system, rather than an incremental
accretion of very specific and inflexible tools,
as is frequently the case with other data
analysis software.
 R, like S, is designed around a true computer
language, and it allows users to add additional
functionality by defining new functions
 Much of the system is itself written in the R dialect of

S, which makes it easy for users to follow the
algorithmic choices made
 For computationally-intensive tasks, C, C++ and

Fortran code can be linked and called at run time.
Advanced users can write C code to manipulate R
objects directly.
 Many users think of R as a statistics system. But it is an
environment within which statistical techniques are
implemented. R can be extended (easily) via packages
 There are about eight packages supplied with the R

distribution and many more are available through the
CRAN family of Internet sites covering a very wide range
of modern statistics.
The Comprehensive R Archive Network is available at the following
URLs, please choose a location close to us
0-Cloud http://cran.csiro.au/
http://cran.rstudio.com/ CSIRO
Rstudio, automatic redirection http://cran.ms.unimelb.edu.au
to servers worldwide /
University of Melbourne
Argentina Austria
http://mirror.fcaglp.unlp.edu.a http://cran.at.r-project.org/
r/CRAN/ Wirtschaftsuniversitaet Wien
Universidad Nacional de La Belgium
Plata http://www.freestatistics.org/c
http://r.mirror.mendoza- ran/
conicet.gob.ar/ K.U.Leuven Association
CONICET Mendoza
Australia
The Comprehensive R Archive Network is available at the following
URLs, please choose a location close to us
Brazil
http://nbcgib.uesc.br/mirrors/cran/
Center for Comp. Biol. at Universidade Estadual de Santa Cruz
http://cran-r.c3sl.ufpr.br/
Universidade Federal do Parana
http://cran.fiocruz.br/
Oswaldo Cruz Foundation, Rio de Janeiro
http://www.vps.fmvz.usp.br/CRAN/
University of Sao Paulo, Sao Paulo
http://brieger.esalq.usp.br/CRAN/
University of Sao Paulo, Piracicaba
China http://cran.rapporter.net/
http://ftp.ctex.org/mirrors/CRA Rapporter.net, Budapest
N/
CTEX.ORG India
http://mirror.bjtu.edu.cn/cran http://ftp.iitm.ac.in/cran/
Beijing Jiaotong University, Indian Institute of Technology
Beijing Madras
http://mirrors.ustc.edu.cn/CRAN
/ Indonesia
University of Science and http://cran.repo.bppt.go.id/
Technology of China
Agency for The Application and
http://mirrors.xmu.edu.cn/CRA Assessment of Technology
N/
Xiamen University
Hungary
Many of these sites can also be accessed using
FTP. In addition, several StatLib mirrors around
the world provide a complete CRAN mirror.
Manuals
http://cran.r-project.org/manuals.html
Journal of R
http://journal.r-project.org/
image plot of a volcano
mathematical annotation in plots
All images are from
"(C) R Foundation,
from http://www.r-project.org"
 The most important of these exceptions are, that
there are no “time bombs”; your copy of PSPP will
not “expire” or deliberately stop working in the
future
 Neither are there any artificial limits on the

number of cases or variables which you can use
 There are no additional packages to purchase in

order to get “advanced” functions; all
functionality that PSPP currently supports is in
the core package.
 PSPP can perform descriptive statistics, T-
tests, anova, linear and logistic regression,
cluster analysis, factor analysis, non-
parametric tests and more
 Its backend is designed to perform its

analyses as fast as possible, regardless of the
size of the input data
 We can use PSPP with its graphical interface

or the more traditional syntax commands
 The 0.8.x release series includes many new
features and analysis options
 GNU PSPP is a program for statistical analysis

of sampled data
 It is a Free replacement for the proprietary

program SPSS, and appears very similar to it
with a few exceptions
 Support for over 1 billion cases
 Support for over 1 billion variables
 Syntax and data files which are compatible with those of SPSS
 A choice of terminal or graphical user interface
 A choice of text, postscript, pdf, opendocument or html output

formats
 Inter-operability with Gnumeric, LibreOffice, OpenOffice.Org and

other free software
 Easy data import from spreadsheets, text files and database
sources
 The capability to open, analyse and edit two or more datasets

concurrently. They can also be merged, joined or
concatenated
 A user interface supporting all common character sets and

which has been translated to multiple languages
 Fast statistical procedures, even on very large data sets

 No license fees , No expiration period & No unethical “end user
license agreements”
 A fully indexed user manual
 Freedom ensured; It is licensed under the GPLv3 or later
 Portability; Runs on many different computers and many different

operating systems (GNU or GNU/Linux are the prefered platforms,
but we have had many reports that it runs well on other systems too)
 PSPP is particularly aimed at statisticians, social scientists and

students requiring fast convenient analysis of sampled data
 As with most GNU software, PSPP can be found on the main GNU ftp
server: http://ftp.gnu.org/gnu/pspp/ (via HTTP) and
ftp://ftp.gnu.org/gnu/pspp/ (via FTP). It can also be found on the
GNU mirrors; please use a mirror if possible
 There are some additional ways you can download or otherwise

obtain PSPP
 Documentation for PSPP is available online, as is documentation for

most GNU software
 More information about PSPP can be had by running info pspp or

man pspp, or by looking at /usr/share/doc/pspp/,
/usr/local/doc/pspp/, or similar directories on your system. A brief
summary is available by running pspp --help
PSPP can be used in several different modes,
depending on the requirements, experience and
preference of the user.
 Terminal Mode
This avoids cluttering the screen with a lot of
dialog boxes, menus and other windows. If you are
familiar with the PSPP syntax, then this is the simplest
way to use the program. If your terminal has cursor
keys, they behave in PSPP in an intuitive manner. You
can also use the HOST command to temporarily
return to the shell at any time. Your session is logged
to a file, so that you can review it later. PSPP is
designed to handle very large volumes of data. Larger
even than the virtual memory of the computer.
Terminal Mode
PSPP can be used in several different modes,
depending on the requirements, experience and
preference of the user.
 Graphic User Interface
This avoids cluttering the screen with a lot of
dialog boxes, menus and other windows. If you are
familiar with the PSPP syntax, then this is the simplest
way to use the program. If your terminal has cursor
keys, they behave in PSPP in an intuitive manner. You
can also use the HOST command to temporarily
return to the shell at any time. Your session is logged
to a file, so that you can review it later. PSPP is
designed to handle very large volumes of data. Larger
even than the virtual memory of the computer.
Graphic User Interface Mode
There is also a non-interactive mode of operation. This is useful for
longer analyses which you want to perform again and again. You can
choose many different formats to save the results of your analyses. This
mode can also be used as part of a wider system, such as automated
creation and processing of data files for online display.
Available output formats include:
Plain Ascii Text

Simple but effective, and very portable.
Unicode Text with UTF-8 encoding
Aesthetically pleasing and works on most modern computers.
PDF
Great for printed documents, but requires a reader to view it.
ODT
The standard of office suite software.
HTML
Great if you want to put your reports on a website.
 PSPP can generate high quality plots to help with
visualisation of the distribution of data. Among the
type of plots which can be displayed are box-and-
whisker plots, normal probability plots and
histograms. These complement descriptive statistics
and help determine the most appropriate type of
analysis for the data, and/or what transformations
are necessary. The data selection capabilities of PSPP
make it simple to generate plots from a subset of
variables or from data which match only certain
criteria
 Plots and graphs created by PSPP are formatted in
standard file formats such as postscript or PNG, so as
to allow easy import into reports or other documents
Recoding and manipulation of data can be
achieved rapidly using PSPP transformations.
Transformations enable you to specify operations
without needing an extra iteration though the
data. Operations may comprise simple boolean
criteria, arithmetic expressions and mathematical
functions. PSPP supports many math functions,
including random number distributions,
trigonometry and date-time conversions.
Transformations may be cascaded, so that many
operations can be applied concurrently. Like
other operations, the data manipulation features
can be performed using either the syntax
commands or through interactive dialog boxes.

Open Source GIS, DATAMINING and Statistics: Dr. V.V. Venkata Ramana

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Open Source GIS, DATAMINING and Statistics: Dr. V.V. Venkata Ramana

Uploaded by

Copyright:

Available Formats

OPEN SOURCE

GIS, DATAMINING and Statistics

Dr. V.V. VENKATA RAMANA

 The main thrust of it is to build an open-source repository of OR

 A repository cannot be created and sustained without a community.

 To that end, COIN-OR serves to educate, to promote awareness, to

OpenForecast: I have not used this, but it looks promising.

FLOPC++: An algebraic modelling language implemented as a

Zimpl: language to translate the mathematical model of a

Cliquer: routines for clique searching.

R: statistics, graphics, and more. Similar to S-plus (both are

Maxima: computer algebra, similar to Mathematica or Maple.

Octave: matrix based mathematics, similar to and "mostly

GNU Scientific Library: C library for mathematical functions,

PSPP: similar to SPSS. It is not too far along in its

•Algebra@ (52) •Geometry@ (27)

 For example 1.27.08 would indicate a revision released in January 27,

 One join the OpenStat discussion group sponsored by Yahoo Groups.

Since the MapServer 6.0 release, the MapServer project includes a

The MapServer Project Steering Committee maintains all of these

 Some of its major features include:

When a request is sent to MapServer, it uses

Therequest may also return images for

 It can be built to support many different vector and

 Most pre-compiled MapServer distributions contain

◦ Initialization File - uses a form with hidden variables to

 Web/HTTP Server - serves up the HTML pages

Advanced gives you full access to choosing

An apache-uninstall.bat script is also available to remove the Apache

•Start Apache from the OSGeo4W shell and navigate to http://127.0.0.1

 OSGeo4W contains Apache already, but you can

 Being a community-driven project, GeoServer is

 GeoServer is the reference implementation of the Open

Web Based Site Admin Application

Google Earth as a MapGuide Client

 Some enthusiastic users took the initiative to

 OpenJUMP is an open source GIS software written

• It is Free (Libre) Software/Open Source released under

• GRASS GIS is an official project of the Open Source

• GRASS is currently used in academic and commercial settings

• The GRASS Development Team has grown into a multi-national

 3D-Raster (voxel) analysis: 3D data import and export, 3D masks,

 Point data analysis: Delaunay triangulation, Surface interpolation

 DTM-Analysis: Contour generation, Cost / path analysis, Slope /

 Geocoding: Geocoding of raster and vector maps including (LiDAR)

 Visualization: 3D surfaces with 3D query (NVIZ), Color assignments,

 SQL-support: Database interfaces (DBF, SQLite, PostgreSQL, mySQL,

 Geostatistics: Interface to "R" (a statistical analysis environment),

 Furthermore: Erosion modelling, Landscape structure analysis,

Attributesare managed in a SQL-based DBMS (PostgreSQL,

The NVIZ visualization tool was enhanced to display 3D vector

New LOCATIONs can be auto-generated eg. by EPSG code

 It offers large data support, an improved topological 2D/3D

 Attributes are managed by default in SQLite format.

 The display manager has been improved for usability

 The NVIZ visualization tool was completely rewritten. Image

for intro to GIS

Data mining, a branch of computer science, is the process of

Data mining is seen as an increasingly important tool by modern

It is currently used in a wide range of profiling practices, such as

Other Computer Science conferences on data mining include:

•DMIN – International Conference on Data Mining;

* Classification - the task of generalizing familiar structure to employ