North Berwick
Craigleith in the background, where puffins and seals live - http://www.seabird.org/
The Fresnel Vocabulary for RDF provides a way to write down a set of instructions for transforming RDF statements into HTML for display. For some time, the ORDF library has included two implementations of Fresnel, one in JavaScript and one in Python. Recently added is a command line tool, simply called fresnel, for rendering HTML documents given a lens and an RDF graph. [more]
Recently I wrote a first draft of a paper in LaTeX, a
document preparation system commonly used for academic papers. I
circulated it around a bit with some encouraging responses
and then MacTed in #swig IRC channel suggested
that I make it available in HTML form. As it turns out this ...
[more]
I've gone on a bit of a tangent from the investigations of the ONS geographic data, exploring how these newfangled nosql databases can be used to store RDF data. This experimenting is for trying out new back-ends for the Public Domain Works / Open Bibliographic Data project. I chose Mongo because the Mongokit python bindings seem to be the best available, providing a nice, convenient ORM to use.
Why store RDF? Because it makes exensible descriptions and document interlinking easy while maintaining a deterministic structure. On the other hand indexing and querying the data can be a right pain. All of the python RDF ORMs (RDFAlchemy, SuRF, etc.) have their problems and none are at the level of maturity that one has come to expect from working with SQL ORM software (Django, SQLAlchemy, etc.). We don't want to use SQL because of schema rigidity. RDF suits better than simple JSON dictionaries (or hashes if you prefer) because of the global scope of identifiers, is easy to lay out your data so that it makes sense in the context of other datasets elsewhere. Many of the nosql databases, (CouchDB, Mongo, Riak), store documents which really are JSON dictionaries. Fortunately there is an unnoficial JSON serialisation for RDF. With some slight modifications we can pretty much store the data directly. The core openbiblio mongo-rdf module provides some facilities for making sure that data is stored in Mongo properly serialised and that this data is available to python with the familiar RDFLib datatypes. We use the RDF subject as the document ID in Mongo. The document itself uses the predicates as keys in the familiar namespace:term form, with the caveat being that the namespaces must be defined in the mongo-rdf module (This is an area for some refactoring in the future to make it more generally useful, perhaps storing the namespace mappings in the database itself in a system table). The values are a list of objects which are stored as dictionaries and made available as rdflib.term.Literal and rdflib.term.URIRef instances. The result is that you can do things like this,The Document here is actually a subclass of the mongokit.Document that takes care of some housekeeping and makes available a method called to_rdf() which returns an rdflib.graph.Graph. This is useful for serialisation, inferencing operations such as with FuXi and anything else that one might want to do with a Graph. The to_rdf() takes optional graph and identifier keyword arguments so that data can be added to an existing Graph or Store as a named graph.
So if we are interested in what Adam Smith wrote we can doNotice the index. Mongo will index the value element of the embedded dictionaries (in this case representing rdflib.term.Literal) evn though it is actually a list of these dictionaries. That call to ensure_index() only has to be done once, of course. The search is done with a regular expression only for illustrative purposes, in this case the simple string "Smith, Adam" would suffice.
This script would have output very similar to that which is found below. Notice how the frbr:Work documents are included. This is because to_rdf() will descend (only once) into any documents that are referenced in the local database and add them in.
There are so many NoSQL systems these days that it's hard to get a quick overview of the major trade-offs involved when evaluating relational and non-relational systems in non-single-server environments. I've developed this visual primer with quite a lot of help (see credits at the end), and it's still a work in progress, so let me know if you see anything misplaced or missing, and I'll fix it.
Without further ado, here's what you came here for (and further explanation after the visual).
Note: RDBMSs (MySQL, Postgres, etc) are only featured here for comparison purposes. Also, some of these systems can vary their features by configuration (I use the default configuration here, but will try to delve into others later).
As you can see, there are three primary concerns you must balance when choosing a data management system: consistency, availability, and partition tolerance.
- Consistency means that each client always has the same view of the data.
- Availability means that all clients can always read and write.
- Partition tolerance means that the system works well across physical network partitions.
According to the CAP Theorem, you can only pick two. So how does this all relate to NoSQL systems?
One of the primary goals of NoSQL systems is to bolster horizontal scalability. To scale horizontally, you need strong network partition tolerance which requires giving up either consistency or availability. NoSQL systems typically accomplish this by relaxing relational abilities and/or loosening transactional semantics.
In addition to CAP configurations, another significant way data management systems vary is by the data model they use: relational, key-value, column-oriented, or document-oriented (there are others, but these are the main ones).
- Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
- Key-value systems basically support get, put, and delete operations based on a primary key.
- Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
- Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
Now for the particulars of each CAP configuration and the systems that use each configuration:
Consistent, Available (CA) Systems have trouble with partitions and typically deal with it with replication. Examples of CA systems include:
- Traditional RDBMSs like Postgres, MySQL, etc (relational)
- Vertica (column-oriented)
- Aster Data (relational)
- Greenplum (relational)
Consistent, Partition-Tolerant (CP) Systems have trouble with availability while keeping data consistent across partitioned nodes. Examples of CP systems include:
- BigTable (column-oriented/tabular)
- Hypertable (column-oriented/tabular)
- HBase (column-oriented/tabular)
- MongoDB (document-oriented)
- Terrastore (document-oriented)
- Redis (key-value)
- Scalaris (key-value)
- MemcacheDB (key-value)
- Berkeley DB (key-value)
Available, Partition-Tolerant (AP) Systems achieve "eventual consistency" through replication and verification. Examples of AP systems include:
- Dynamo (key-value)
- Voldemort (key-value)
- Tokyo Cabinet (key-value)
- KAI (key-value)
- Cassandra (column-oriented/tabular)
- CouchDB (document-oriented)
- SimpleDB (document-oriented)
- Riak (document-oriented)
Self promotion and Credits
- If you're a developer and looking for a job or if you're hiring developers and these data systems are important to you, consider coming to Hirelite: Speed Dating for the Hiring Process on Tuesday in NYC.
- This guide draws heavily from a recent Ruby meetup (by Matthew Jording and Michael Bryzek) and a recent MongoDB presentation (given by Dwight Merriman).
- Thanks to DBNess and ansonism for their help with validating system categorizations.
- Thanks to those who helped shape the post after it was written: Stan, Dwight, and others who commented here and on this Hacker News thread.
Update: Here's a print version of the Visual Guide To NoSQL Systems if you need one quickly (warning: it's not all that pretty and I may not keep it updated, but as of 3/17/2010, it's current).
Made sure PHP was installed and then headed over to http://sourceforge.net/projects/triplify/files/ to download the latest triplify (0.8 at the time of this writing).
After a first try at modifying triplify's config.inc.php, I found that PHP wasn't built with its PDO database driver. Had to add "pdo and odbc" to the USE flags for PHP and then rebuild. I've ended up with the following entry in /etc/portage/package.use:
dev-lang/php apache2 berkdb bzip2 cli crypt curl gdbm iconv imap ipv6 ncurses nls pcre readline reflection session spl sqlite ssl unicode xml zlib pdo odbc
The second try resulted in a segfault. Sprinkling some debugging print statements (specifically in the function dbQuery() I found that the
segfault was caused by triplify doing a sub-select query. The dialect of SQL that is spoken by the mdbtools ODBC driver is not nearly that sophisticated -- though it is too bad it died with a segmentation fault.
The UK Office for National Statistics distributes some of its datasets as Microsoft Access databases. How to take that and turn it into linked data? This is the first of a three part series to try to answer this question. This first installment covers basic plumbing for accessing the database.
The first step is to install mdbtools. This is old software, hasn't seen much of an update in the past five years, but appears to work. The only hitch (on gentoo linux) is that by default you will get <em>mdbtools-0.6_pre1-r1</em>. This one is broken. You need to add,=app-office/mdbtools-0.6_pre2-r2 ~x86
to /etc/portage/package.keywords to get a version that works. Make sure to build with odbc support as well.
Once mdbtools is installed you can look at the access database. For example to get a list of tables,ww@mavrino$ mdb-tables ChangeHistoryDatabase_V1.mdb ADMIN ADMIN_010109 CENSUS CH2008 ChangeHistory Changes CHDB_Update ELECTORAL Entity Table Equivalents Gazetteer InfoTable Lookup Name Changes SIDetails W01-LSOA WalesWards Welsh Equiv Year Table - SI Year Table
There is actually a fork in the path here. One strategy is to take the Access database and export it using PostgreSQL syntax, then import it into a PostgreSQL database. The problem is, there are some funny datatypes in some tables, specifically Memo/Hyperlink that the mdb-schema program doesn't know how to translate into PostgreSQL's dialect of SQL. This wouldn't be so hard to fix and will be our backup strategy, but how much easier would it be if we could talk to the Access database directly?
So we need to inform unixODBC about our mdbtools' driver. This is done by editing /etc/unixODBC/odbcinst.ini (or /etc/odbcinst.ini on some systems) and adding,[MDBToolsODBC] Description = MDB Tools ODBC Driver = /usr/lib/libmdbodbc.so.0 Setup = FileUsage = CPTimeout = CPReuse =
And then we need to add a data source for the change history database in /etc/unixODBC/odbc.ini,
[change_history] Description = ONS Change History Driver = MDBToolsODBC Database = /some/where/ChangeHistoryDatabase_V1.mdb Servername = localhost Username = Password = port = 5432
Now you should be able to connect to the database with unixODBC's isql command simply by doing "isql change_history".
Now that the database is accessible to programs running on the system, the next step is to get triplify talking to it.
I noticed that 4store running on Linux hosts was rather slow
when connecting the client. Once connected it worked just
fine. As well on one host it was consistently failing to
connect about 50% of the time.