LaTiS - OPeNDAP

Report
LaTiS
https://github.com/dlindhol/LaTiS
Doug Lindholm
Laboratory for Atmospheric and Space Physics
University of Colorado Boulder
ESIP – July 8, 2014
Motivation - Get Data Into Analysis Code/Tools
Disparate Data
Unified
Interface
LaTiS Server Architecture
Native Data
Descriptors
Adapters
Filters
Writers
ASCII
Subset
CSV
Constrain
(sst > 20)
JSON
Convert
Units
DAP2
Missing
Values
Image
Derived
Products
code
snippet
TSML
TSML
Binary
JDBC
FITS
TSML
Web
Service
TSML
Custom
LaTiS Data Model
TSML
Custom
Client
Applications
Web
Browser
Excel
Analysis
Tools
Program
s
Custom
Web
Service
LaTiS Client Options
• Any OPeNDAP client. Available for most
programming languages (python, IDL, Matlab,...).
• Analysis/visualization tools with built in OPeNDAP
support.
• Web browser: Directly enter http URL query.
• wget, curl: command line tools for making an HTTP
request.
• Custom Web Applications (Open Source coming
soon) that make AJAX requests to LaTiS to get
JSON output and make interactive plots.
• Custom programming APIs that wrap a LaTiS call.
Related Technology Comparisons
•
•
•
•
•
OPeNDAP
– Both implement DAP2 protocol (standard service API)
– OPeNDAP servers tend to be file centric
– LaTiS presents “virtual” dataset via aggregation
– LaTiS aims to be easier to install, configure, and extend
NetCDF Common Data Model (CDM)
– Multidimensional array centric
– Coupled to NetCDF file format
– Climate and forecast model (simulation) emphasis
THREDDS Data Server
– Built around NetCDF CDM
– Provides OPeNDAP and other service interfaces
TSDS
– First generation of LaTiS built on NetCDF CDM
VisAD
– Essentially the same logical data model as LaTiS with a clunkier implementation
based on old Java capabilities
– LaTiS is implemented around modern paradigms like Functional Programming
What do I mean by Data Model
•
•
•
•
•
NOT a simulation or forecast (climate model)
NOT a metadata model (ISO 19115)
NOT a file format (NetCDF)
NOT how the data are stored (RDBMS)
NOT the representation in computer memory
(data structure)
• Logical model
• What the data represent, conceptually
• How the data are used
Data Abstractions
bits
10110101000001001111001100110011111110
bytes
00105e0 e6b0 343b 9c74 0804 e7bc 0804 e7d5 0804
int, long, float, double, scientific notation (Number)
1, -506376193, 13.52, 0.177483826523, 1.02e-14
array
1.2
3.6
2.4
1.7
-3.2
Scientific Data Abstractions
Multi-dimensional Arrays
Key Features:
- Single data type
- Access by index
Relational Data
Relational Database
Table = Relation
Row = Tuple of Attributes
e.g. (0, 3.5, B)
Key Features:
- Supports different data types
- Well suited for access by value
e.g. time>2, class=A
time
flux
clas
s
0
3.5
B
1
4.6
A
2
4.7
A
3
4.1
A
4
3.2
B
But the relation is limited to a sequence of tuples:
LaTiS Unified Data Model
• Extends the Relational Model to add Functional relationships.
• Represents multi-dimensional domain of data grids.
• Access by value or index.
Independent
Variable
(domain)
Dependent
Variables
(range)
Example: time series of gridded surface winds
Time -> ((Lon, Lat) -> (U,V))
LaTiS Data Model
Only Three Variable Types:
Scalar: single Variable
Tuple: group of Variables
Function: mapping from
one Variable to another
Extend to capture higher level,
domain specific abstractions
Discipline Agnostic Data Access with LaTiS
Philosophy: Leave data in their native form
Expose via a common interface
Software:
• Reusable adapters (software modules) to read common
formats, extension points for custom formats
• XML dataset descriptors, map native data model to the
LaTiS data model
• Open Source, community
Web services:
• Standard service interfaces, currently OPeNDAP
• Server side processing and output format options
Implementing the Data Model
• The LaTiS Data Model is an abstract representation
• Can be represented several ways
– UML
– VisAD grammar
– Java Interface (no implementation)
• Need an implementation in code
• Scientific data Domain Specific Language (DSL)
– Expose an API that fits the application domain
• Scala programming language
– http://www.scala-lang.org/
Why Scala
•
•
•
•
•
•
Evolution of Java
– Use with existing Java code
– Runs on the Java Virtual Machine (JVM)
– Command line (REPL), script, or compiled
– Statically typed (safer than dynamic languages)
– Industrial strength (Twitter, LinkedIn, …)
Object-Oriented
– Encapsulation, polymorphism, …
– Traits: interfaces with implementation, multiple inheritance, mix-ins
Functional Programming
– Immutable data structures
– Functions with no side effects
– Provable, parallelizable
Syntactic sugar for Domain Specific Languages
Operator “overloading”, natural math language for Variables
Parallel collections
Scala Implementation
• Dataset as a Scala collection
• Functional Programming Paradigms:
– Function composition over object manipulation
– Functions as first class citizens
• a LaTiS Function can be used like a programming function
– Immutable data structures
– No side-effects: parallelizable, provable
– Lazy evaluation: scalable
• Math and resampling mixed in
– e.g. dataset3 = (dataset1 + dataset2) / 2
• Metadata encapsulated
– enforce data consistency: unit conversions ...
– track provenance
LaTiS Server Implementation
• RESTful web service API (OPeNDAP +)
• Java Servlet, build and deploy war file
• XML dataset descriptor (TSML) for each dataset
– Specify Adapter to use
– Map native data source to LaTiS data model
– Define transformations as Processing Instructions
• Catalog to map dataset names to TSML
• Plugins: implement the Adapter, Filter or Writer interfaces or
extend existing ones
• Properties file to map filter and writer names to
implementing classes
Example – Serving an ASCII File
Sunspot data
for October 2003
2003 10 01
2003 10 02
2003 10 03
2003 10 04
2003 10 05
2003 10 06
2003 10 07
2003 10 08
2003 10 09
2003 10 10
2003 10 11
2003 10 12
2003 10 13
2003 10 14
2003 10 15
2003 10 16
2003 10 17
2003 10 18
2003 10 19
2003 10 20
2003 10 21
2003 10 22
2003 10 23
2003 10 24
2003 10 25
2003 10 26
2003 10 27
2003 10 28
2003 10 29
2003 10 30
2003 10 31
75
72
59
60
53
51
50
56
58
50
44
22
12
4
17
24
37
43
43
64
66
72
68
81
89
102
141
161
167
171
156
TSML Dataset descriptor
<?xml version="1.0" encoding="UTF-8"?>
<tsml>
<dataset name="Sunspot_Number"
history="Read by LaTiS">
<adapter class="latis.reader.tsml.AsciiAdapter"
url="file:/data/latis/ssn.txt" />
<time units="yyyy MM dd” />
<integer name="ssn” />
</dataset>
</tsml>
Example – Serving an ASCII File
Current Applications
• LASP Interactive Solar Irradiance Data Center (LISIRD)
– Uses LaTiS to read, subset, reformat data, metadata
– http://lasp.colorado.edu/lisird/
• Time Series Data Server (TSDS)
– Common RESTful interface to NASA Heliophysics data
– http://tsds.net/
Other LASP projects: MMS, MAVEN, database statistics, log files
External users?
Capabilities – Data Reader Modules
• Operational:
– ASCII (file, web service, system call), binary, NetCDF,
Relational database, data “generators”
– Time Series of scalars, vectors, and spectra
– Arbitrarily long time series
• Prototyped:
– HDF, CDF, FITS, GRIB, OPeNDAP (e.g. other LaTiS
servers), NoSQL (MongoDB)
– Nested 2D (gridded) data structures
• Planned:
– Arbitrarily complex data structures
Capabilities – Data Writer Modules
• Operational:
– OPeNDAP, ASCII (e.g. csv), binary, JSON, Image
(PNG), IDL code, HTML dataset landing page
• Prototyped:
– NetCDF, HDF, IDL save file, interactive plot
• Planned:
– GeoTIFF, …
Capabilities – Data Filter Modules
• Operational:
– Subset, aggregate, stride, thin, replace, integrate, bin
average
• Prototyped:
– FFT, min, max, unique, resampling, unit conversion
• Planned:
– Coordinate system transformations
– Make it easier to plug in custom computations
– Track provenance
Capabilities – Service Interface
• Operational:
– OPeNDAP
– Java Servlet, simply deploy war file (Tomcat, Glassfish)
• Prototyped:
– Authentication
– Single executable (jetty)
– THREDDS Data Server (TDS) integration
• Planned:
– Open Geospatial Consortium (OGC) standards
• Web Map Server
• Web Coverage Server
Capabilities - Metadata
• Operational:
– THREDDS catalog, static XML, browse
• Prototyped:
–
–
–
–
Semantic Web triple store (RDF, SPARQL)
Text search (Solr)
Modeling RDF triples (subject, predicate, object)
Track provenance, record Dataset modifications
• Planned:
– Serve metadata in various schema (e.g. ISO 19115,
SPASE)
– Unique IDs, Digital Object Identifiers (DOI) for publishing
Other Capabilities
• Operational:
– Time API with formatting
– Time conversions with leap seconds
• Prototyped:
– Caching, improve performance
– Parallel processing, multi-core
• Planned:
– Big Data, Hadoop, Map Reduce
– Workflow integration
Source Code Management – Open Source
• Time Series Server (a.k.a. TSS1)
– Core of Time Series Data Server (TSDS, tsds.net)
– Built around Unidata Common Data Model
– SourceForge: https://sourceforge.net/projects/tsds/
• LaTiS (a.k.a. TSS2)
–
–
–
–
New LaTiS data model, scala implementation
GitHub: https://github.com/dlindhol/LaTiS
LASP internal development branch
Plug-ins as separate projects (e.g. data collections, math,
custom readers/writers,…), keep core small
My Background (i.e. bias)
• Astrophysicist by degree, software engineer by
profession
• Data user and provider
• Scientific data applications developer:
– astrophysics, atmospheric science, space science
• Holy Grail: common data model
• Favorite scientific data models:
– VisAD (http://www.ssec.wisc.edu/~billh/visad.html)
– Unidata Common Data Model
(http://www.unidata.ucar.edu/software/netcdf-java/CDM/)
– OPeNDAP (http://www.opendap.org/)
Motivation – Stove Pipes
Single Data Access Interface

similar documents