Structured data on the web

Build community indexes from structured data to address FAIR data goals

Build search indexes for your community or domain of interest. Focused and functional to address your specific needs. Gleaner is open source, written in Go and easy to deploy. It is one part of the GleanerIO search architecture, details below.

Gleaner

Gleaner

Gleaner is a tool for extracting JSON-LD from web pages. You provide Gleaner a list of sites to index and it will access and retrieve pages based on the sitemap.xml of the domain(s). Gleaner can then check for well formed and valid structure in documents and process the JSON-LD data graphs into a form usable to drive a search interface. It is part of the bigger picture.

connected world

Open Foundation

Communities of practice can leverage open schema (schema.org) along with web architecture approaches to build domain search portals. Enhance and extend with community vocabularies to address specific domain needs. This foundation is also leveraged by Google Data Set Search and is complementary to that service. Web architecture as foundation allows a community to provide a more detailed community experiences, while still leveraging the global reach of commercial search indexes.

Big Picture

Gleaner is part of the larger GleanerIO approach. GleanerIO includes approaches for leveraging spatial, semantic, full text or other index approaches. Additionally there is guidance on running Gleaner as part of a routinely updated index of resources and a reference interface for searching the resulting graph. GleanerIO provides a full stack approach to go from indexing to a basic user interface searching a generated Knowledge Graph, an example index. The whole GleanerIO stack can be run on a laptop (it uses Docker Compose files) or deployed to the cloud. Cloud environments used include AWS, Google Cloud, and OpenStack.

GleanerIO is also designed to play well with others. As long as packages work well in a web architecture framework, they likely can be integrated into the GleanerIO approach. The GleanerIO approach is modular and even Gleaner itself could be swapped out for other implementations.

Indeed, GleanerIO advocates _principles over project_. GleanerIO is really just a set of principles for which reference implementations (projects) have been developed or external projects have been used. These have evolved and been implemented to address communities like Ocean InfoHub, Internet of Water, GeoCODES and more. The results and approaches of these communities are openly maintained at the GleanerIO GitHub Organization pages. They provide guidance on how yet other communities could leverage this approach to address their functional needs. See: The Big Picture

History

Communities of practice can leverage open schema (schema.org) along with web architecture approaches to build domain search portals. Enhance and extend with community vocabularies to address specific domain needs. This foundation is also leveraged by Google Data Set Search and is complementary to that service. Web architecture as foundation allows a community to provide a more detailed community experiences, while still leveraging the global reach of commercial search indexes.

Principles

Where to Get Engaged

Get engaged with RDA, EarthCube and the ESIP Science on Schema group!