DATA MODELS IN GIS OUTLINE: Overview of models Data and levels of measurements Raster and vector models Conversion between models Databases DIGITAL INFORMATION GIS requires that both data and maps be represented as numbers GIS places data into the computer’s memory in a physical data structure (i.e. files and directories). files can be written in binary or as ASCII text. binary is faster to read and smaller, ASCII can be read by humans and edited but uses more space. sent through a “pipe” consisting of 0s and 1s stored on devices that can store only 0s and 1s processed as 0s and 1s DATA locational and attribute data in a GIS attribute type: discrete vs continuous discrete: presumed to occur at distinct locations with empty locations having a value of zero for the attribute in question continuous: feature occurs throughout geographical region; no locations are empty DATA Levels of Measurement: four levels are commonly recognized – nominal, ordinal, interval and ratio each subsequent level includes all characteristics of preceding levels data available at higher levels can be reduced to lower levels; opposite is not true LEVEL OF MEASUREMENTS Nominal Scale objects are classed into groups; groups possess arbitrary labels (numbers/names) i.e. religion, land use/cover discrete variable LEVEL OF MEASUREMENTS Ordinal Scale categorization plus an ordering/ranking of data i.e. country road, street, highway can identify larger/smaller but can not comment on degree between variables K=5, L=3, M=1 equivalent to K=500, L=300, M=10 discrete variables LEVEL OF MEASUREMENTS Interval Scale measurements arranged in rank and distance between measurements is known no “true” zero point i.e. elevation/topographic lines, temperature in oC discrete or continuous LEVEL OF MEASUREMENTS Ratio Scale like interval scaling: both rank and separation are known, but there is also a known, fixed starting point i.e. temperature on Kelvin scale; speed continuous and discrete DATA MODELS – REPRESENTING DATA 1. Reality – total phenomena as they actually exist 2. Conceptual Data Model – describes and defines included entities (how they will be represented) 3. Logical Data Model – logical organization of the database elements 4. Physical Data Model or File Structure – how information will be structured for access DATA MODELS logical data model is how data are organized for use by the GIS. GISs have traditionally used either raster or vector for maps. raster – based on pixels vector – based on points, lines and polygons while most GIS systems can handle raster and vector, only one is used for the internal organization of spatial data. DATA MODELS rasters and vectors can be flat files … if they are simple Raster-based line Vector-based line Flat File 0000000000000000 0001100000100000 1010100001010000 1100100001010000 0000100010001000 0000100010000100 0001000100000010 0010000100000001 0111001000000001 0000111000000000 0000000000000000 Flat File 4753456 4753436 4753462 4753432 4753405 4753401 4753462 4753398 623412 623424 623478 623482 623429 623508 623555 623634 RASTER DATA MODELS basic unit is cells or pixels which are uniformly spaced each cell/pixel has spatial and spectral information. i.e. digital elevation data and digital images spatially exhaustive sampling of the area of interest every cell has a value, even if it is “missing.” cell has a resolution, given as the cell size in ground units. higher resolution, smaller cell dimensions RASTER DATA MODELS Grid extent Rows Grid cell Resolution Columns Generic structure for a grid. RASTER DATA MODELS RASTER DATA MODELS Fining of Resolution RASTER DATA MODELS CREATING RASTER DATA MODELS creating raster is like laying a grid over a map code each cell with a value representing attribute every cell has a value, even if null or zero (integers, ratios, etc.) values for each cell are written into a file spreadsheet, data base, word processor imported into GIS so it can be reformatted each pixel presumably has one value – in reality is this correct? mixed pixel issue RASTER AND MISSING DATA GIS data layer as a grid with a large section of “missing data,” in this case, the zeros in the ocean off of New York and New Jersey. MIXED PIXEL ISSUE Water dominates Winner takes all Edges separate W W G W G G W E G W W G W W G W E G W W G W G G E G E MIXED PIXEL ISSUE “Largest share” Water Land “Central point” “Presence/Absence” 35% 70% 80% 100% “Percent occurrence” CREATING RASTER DATA MODELS raster data visualized as map layers map layer: data describing a single characteristic for a location multiple items of information require multiple layers creates problems – raster databases can become enormous each map layer has thousands of cells RASTER DATA MODELS Advantages simple data structures each cell can be owned by only one feature. overlay and combination of maps and remote sensed images easy simulation easy, because cells have the same size and shape technology is cheap RASTER DATA MODELS Advantages some spatial analysis methods simple to perform local: cell by cell calculations focal: models cell value based on neighbours zonal: models cell value based on geographical areas global: models cell value based on all cells RASTER DATA MODELS Disadvantages volumes of graphic data use of large cells to reduce data volumes poor at representing points, lines and areas; good at surfaces must often include redundant or missing data network linkages are difficult to establish projection transformations are time consuming COMPRESSION TECHNIQUES raster compression techniques used in GIS are runlength encoding and quad trees Run-length Encoding – more efficient values often occur in runs across several cells form of spatial autocorrelation e.g. array 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 would be entered as 3 0 2 1 2 0 3 1 2 0 3 1 RUN-LENGTH CODING Row-by-row coding: CCCCCBBDCCCCBBDCCCBBBDDCBBA ADDDDBAADDBBBAADDDAAAADDDA AAA Run-length coding: 5C 2B 1D 4C 2B 1D 3C 3B 2D 1C 2B 2A 4D 1B 2A 2D 3B 2A 3D 4A 3D 4A A. Mixed Conifer B. Douglas Fir 56 entries for 7x8 array, or C. Oak Savannah 22 pairs (44 entries) for 7x8 array D. Grassland COMPRESSION TECHNIQUES Quadtree Compression hierarchical data model using a variable-sized grid cell finer subdivisions are used in areas requiring finer detail (higher resolution) pixel in each higher layer is derived from average or majority of 4 pixels from the lower layer not as efficient for more variable or complex data used primarily as a way to store data for rapid retrieval on display devices QUAD TREE STRUCTURE RASTER DATA FORMAT most raster formats are digital image formats. most GISs accept TIF, GIF, JPEG or encapsulated PostScript, which are not georeferenced. DEMs are true raster data formats. RASTER DATA FORMAT VECTOR DATA MODELS think of world as a space populated by discrete features of various shapes and kinds – points, lines, areas. any location in space may be empty or occupied by one or more point, line or area. VECTOR DATA MODELS point zero-dimensional abstraction of an object represented by a single X,Y co-ordinate. normally represents a geographic feature too small to be displayed as a line or area stored by their real (earth) coordinates VECTOR DATA MODELS line set of ordered co-ordinates that represent the shape of geographic features too narrow to be displayed as an area at the given scale or linear features with no area lines and areas are built from sequences of points in order. lines have a direction to the ordering of the points. VECTOR DATA MODELS polygon feature used to represent areas. defined by the lines that make up its boundary and a point inside its boundary for identification. have attributes that describe the geographic feature they represent. VECTOR DATA MODELS vector data evolved the arc/node model in the 1960s. an area consist of lines and a line consists of points. points, lines, and areas can each be stored in their own files, with links between them. endpoint of a line (arc) is called a node; arc junctions are only at nodes. stored with the arc is the topology (i.e. the connecting arcs and left and right polygons). TOPOLOGY topological data structures dominate GIS software. stored explicitly allows automated error detection and elimination. rarely are maps topologically clean when digitized or imported. GIS has to be able to build topology from unconnected arcs. 13 11 2 12 10 7 POLYGON “A” 5 4 9 1 2 6 3 8 1 1xy 2xy 3xy 4xy 5xy 6xy 7xy 8xy 9xy 10 x y 11 x y 12 x y 13 x y Points File TOPOLOGY File of Arcs by Polygon A: 1,2, Area, Attributes 1 1,2,3,4,5,6,7 2 1,8,9,10,11,12,13,7 Arcs File Arc/Node Map Data Structure with Files. TOPOLOGY relationship between nodes, arcs and polygons. topologically structured database for ease of retrieval and implementation of spatial-relational operations. advantages: simple, elegant and efficient relational database construction and analysis complete topology makes map overlay feasible. topology allows many GIS operations to be done without accessing the point files. VECTOR DATABASE CREATION database creation involves several stages: input of the spatial data input of the attribute data linking spatial and attribute data spatial data is entered via digitized points and lines, scanned and vectorized lines or directly from other digital sources once the spatial data has been entered, much work is still needed before it can be used VECTOR DATABASE CREATION Building Topology once points are entered and geometric lines are created, topology must be "built" this involves calculating and encoding relationships between the points, lines and areas this information may be automatically coded into tables of information in the database VECTOR DATABASE CREATION Editing during topology generation process, problems such as overshoots, undershoots and spikes are either flagged for editing by the user or corrected automatically automatic editing involves the use of a tolerance value which defines the width of a buffer zone around objects within which adjacent objects should be joined VECTOR DATA MODELS Advantages good representation of structures (points, lines, polygons) compact and more efficient topology can be completely described accurate graphics retrieval, updating and generalization of graphics and attributes possible work well with pen and light-plotting devices and tablet digitizers. VECTOR DATA MODELS Disadvantages complex data structures combination of several vector polygon maps or polygon and raster maps through overlay creates difficulties simulation is difficult display and plotting can be expensive technology is expensive not good at continuous coverage or plotters that fill areas. TIN must be used to represent volumes. VECTOR DATA FORMATS vector formats are either page definition languages or preserve ground coordinates. page languages are HPGL, PostScript, and Autocad DXF. true vector GIS data formats include ArcView Shapefiles and ArcGIS Interchange Files (E00) which has topology. VECTOR DATA MODELS List of coordinates “spaghetti” simple easy to manage no topology lots of duplication, hence need for large storage space very often used in CAC (computer assisted cartography) VECTOR DATA MODELS Vertex Dictionary no duplication, but still this model does not use topology VECTOR DATA MODELS Dual Independent Map Encoding (DIME) developed by US Bureau of the Census nodes (intersections of lines) are identified with codes assigns a directional code in the form of a "from node" and a "to node" both street addresses and UTM coordinates are explicitly defined for each link VECTOR TO RASTER EXCHANGE data exchange by translation (export and import) can lead to significant errors in attributes and in geometry. efficient data exchange is important for the future of GIS. VECTOR TO RASTER EXCHANGE ADVANCED DATA MODELS - TIN triangulated irregular network is a set of elevation points which have been connected to form a network of triangles. developed in early 1970s as a simple way to build a surface the sample points are connected by lines to form triangles; within each triangle the surface is usually represented by a plane triangles fit together in a manner which simulates the face of the land. ADVANCED DATA MODELS - TIN ADVANCED DATA MODELS - TIN irregularly spaced sample points can be adapted to the terrain rough terrain - more points smooth terrain - less points an irregularly spaced sample is more efficient ADVANCED DATA MODELS - TIN TINs can be seen as polygons having attributes of slope, aspect and area, three vertices having elevation attributes TIN model work best in areas with sharp breaks in slope ADVANCED DATA MODELS - TIN ADVANCED DATA MODELS - TIN Advantages ability to describe the surface at different level of resolution efficiency in storing data allows simple calculation of basin areas, slopes, channels, and many other geometric parameters Disadvantages in many cases require visual inspection and manual control of the network DATABASES a spatial database is a collection of spatially referenced data that acts as a model of reality these selected phenomena are deemed important enough to represent in digital form the digital representation might be for some past, present or future time period DIGITIAL DATABASES scaleless- data can be stored at the level of detail found in the environment cartographer is responsible for choosing the content and resolution scale critical factor: level of resolution set by field instruments digitizing - resolution of instrument and abstraction and production factors DIGITIAL DATABASES problems when using data sets of different resolutions i.e. roads may not line up resolved using ancillary source materials additional problems when using data sets of different themes i.e. combing elevation and drainage data – water running uphill or non-level lakes DIGITIAL DATABASES Value of databases: Cost of creation – cheaper to get data from an existing database Appropriateness of use Lack of alternative data sources Graphic output METADATA “data about the data” could include data elements that: identify the data, identify the custodians and access conditions to the data, describe projection, content, quality of data describes the action taken when handling databases of varying scale Dataset information Title Ortofotos'95 Abstract Ortofotos'95 is a collection of ortho-rectified aerial photographs. These aerial photographs cover Portugal and were obtained in August 1995 in false color infra red film at scale 1:40 000. CNIG, The Directorate General of Forests and The Paper Mill industry are the owners of the aerial photographs (in paper format). Type of dataset Airborne data>Aerial photos Locations Portugal Temporal Range 1995- Dataset scales 1:25 000-1:50 000 Dataset resolution 1 - 3 meters Dataset quality remarks Aquisition of data: aerial photographs, the film is scanned at very high resolution and ortho-rectified using DTM derived from topographic cartography at scale 1:25 000 Information creation date 1999-10-29 DATABASES pre-1970s, command line based with read and write to hard disk, tapes, diskettes database approach – all reading and writing through simple interface (no need to care about tapes, etc.) small GIS projects sufficient to store geographic information as simple files. with large data volumes and number of data users best to use a database management system (DBMS) relational design has been the most useful (since 1980s) DATABASE MANAGEMENT SYSTEMS contain tables or feature classes in which: rows: entities, records, observations, features all information about one occurrence of a feature columns: attributes, fields, data elements, varaibles one type of information for all features key field is an attribute whose values uniquely identify each row Parcel Table entity Parcel # 8 9 36 75 Key field Address 501 N Hi 590 N Hi 1001 W. Main 1175 W. 1st Block 1 2 4 12 Attribute $ Value 105,450 89,780 101,500 98,000 DATABASES - RDBM tables are related or joined using a common record identifier (column variable) present in both tables Example: goal: produce map of values by distinct/neighbourhood problem: no distance code available in parcel table Parcel # 8 9 36 75 Parcel Table Address Block 501 N Hi 1 590 N Hi 2 1001 W. Main 4 1175 W. 1st 12 $ Value 105,450 89,780 101,500 98,000 DATABASES - RDBM solution: join parcel table containing values with geography table containing location codings, using Block as key field Parcel # 8 9 36 75 Parcel Table Address Block 501 N Hi 1 590 N Hi 2 1001 W. Main 4 1175 W. 1st 12 $ Value 105,450 89,780 101,500 98,000 Secondary or foreign key Block 1 2 4 12 Geography Table District Tract A 101 B 101 B 105 E 202 City Dallas Dallas Dallas Garland DATABASES - RDBM Relational Linkages Spatial Attributes Water Right Locations Descriptive Attributes DATABASES Advantage very flexible export data to another system easily enables simple operations i.e. search for records satisfying some condition Description New Ice Nilas, Ice Rind Young Ice Grey Ice Grey-White Ice First-Year Ice Thin First-Year Ice Thin First-Year Ice, first stage Thickness <10 cm 0-10 cm 10-30 cm 10-15 cm 15-30 cm 30-200 cm 30-70 cm 30-50 cm Code 1 2 3 4 5 6 7 8 Thin First-Year Ice, second stage 50-70 cm 9 Medium First-Year Ice Thick First-Year Ice Old Ice Second-Year Ice Multi-Year Ice 70-120 cm 120-200 cm 1. 4. 7. 8. 9.