Europe Countries – DbPedia+SPARQL+Python

Introduction

DBpedia is a project that aims to extract structured information from Wikipedia. They generate RDF information that can be queried in different ways, one of this is using the SPARQL query language.

In this post I’ll show how to query Dbpedia using a python SPARQL wrapper to obtain all the Wikipedia points of places in Europe. The post explains it step by step, you can fork the code at my github.


..
NOTE from the future:
the original code was Europe.py. The closest actual code would be alldbpediapoints.py
..


Results

The result will look something like this:

DBpedia Europe Points

DBpedia Europe Points

This image is centered in the continental Europe. The result also shows points in Africa or farther away.

Process

This is purely a “how to query dbpedia” problem. To query the data from Python I need the python sparql wrapper:
sudo apt-get install python-sparqlwrapper

I split the problem in three parts:

  • Retrieve dbpedia uris for the countries
  • Retrieve the dbpedia points
  • Print result to stdout

== Countries

The following query will get all the “articles” from dbpedia with an rdf:type of EuropeanCountry. You can try it on the web dbpedia sparql endpoint.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>

SELECT ?place WHERE {
    ?place rdf:type yago:EuropeanCountries .
    ?place rdf:type dbpedia-owl:Country
}

The next code shows how to query the sparql endpoint using Python, and the JSON result is stored on the results array.

from SPARQLWrapper import SPARQLWrapper, SPARQLExceptions, JSON

sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setReturnFormat(JSON)

sparql.setQuery("""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>

SELECT ?place WHERE {
    ?place rdf:type yago:EuropeanCountries .
    ?place rdf:type dbpedia-owl:Country
}
"""
)

results = sparql.query().convert()

== Points

The query I propose to do that is the following. Note that there are no PREFIXES defined to make the code easyer to interpret. You can try the query for Malta.

Multiple things to note here. I initially used the geo:geometry predicate, but found the problem that some of the places do not have geo:geometry defined, then I decided to use geo:lat and geo:lon predicates.

You may also note the aggregate functions MIN for geolong and geloat. I added this because in some cases, for the same city multiple points were defined, and in this case I only want one. That is why I group the results by title and obtain the minimum lon lat values (I used min, I could use max.. nothing really scientific here)

           
SELECT ?title (MIN(?geolat) AS ?geolat) (MIN(?geolong) AS ?geolong)
    WHERE {
        ?place rdf:type <http://dbpedia.org/ontology/Place> .
        ?place dbpedia-owl:country <http://dbpedia.org/resource/Malta> .
        ?place foaf:name ?title .
        ?place geo:lat ?geolat .
        ?place geo:long ?geolong .
    }
    GROUP BY ?title

Given that the results variable contains (from the previous section) all the uris for the countries, I can loop it changing the queried country every time:

for country in results["results"]["bindings"]:
    country_uri = country["place"]["value"]
    country_name = country_uri.rpartition('/')[-1]

    try:
        sparql.setQuery("""
          SELECT ?title (MIN(?geolat) AS ?geolat) (MIN(?geolong) AS ?geolong)
          WHERE {
            ?place rdf:type <http://dbpedia.org/ontology/Place> .
            ?place dbpedia-owl:country <""" + country_uri + """> .
            ?place foaf:name ?title .
            ?place geo:lat ?geolat .
            ?place geo:long ?geolong .
          }
          GROUP BY ?title
        """)

        country_results = sparql.query().convert()
    except:
        sys.exit()

That would do it!. But there’s a little problem involved. Try to query it for France , you get a beautiful Bandwidth Limit Exceeded URI = '/!sparql/'. This endpoint is a public and limited. What to do?

In this case I add the keywords OFFSET and LIMIT. The first one defines the offset where the result listing should start, the second one limits the number of results to show. The script will query by parts until it finds a query that returns zero values:

for country in results["results"]["bindings"]:
    country_uri = country["place"]["value"]
    country_name = country_uri.rpartition('/')[-1]
    total_results = 1
    offset = 0

    while total_results > 0:
        try:
            sparql.setQuery("""
              SELECT ?title (MIN(?geolat) AS ?geolat) (MIN(?geolong) AS ?geolong)
              WHERE {
                ?place rdf:type <http://dbpedia.org/ontology/Place> .
                ?place dbpedia-owl:country <""" + country_uri + """> .
                ?place foaf:name ?title .
                ?place geo:lat ?geolat .
                ?place geo:long ?geolong .
              }
              GROUP BY ?title
              OFFSET """ + str(offset) + """
              LIMIT 10000
            """)

            country_results = sparql.query().convert()
            total_results = len(country_results["results"]["bindings"])
            offset = offset + 10000

        except Exception as inst:
            print type(inst)
            print "EXCEPTION"

With this approach I limit the total amount of data transfered, avoiding the Bandwidth exceeded error.

== STDOUT

Printing to standard output is the easy solution :-), not a lot of effort here.
Inside the while, I add a printing statement in CSV fashion with a WKT (Well Known Text) point definition

for result in country_results["results"]["bindings"]:
                print(result["title"]["value"].encode("utf-8") + ";POINT(" + result["geolong"]["value"] +" "+ result["geolat"]["value"] +");" + country_name)

Qgis

Loading a CSV in Qgis is easily achieved using the “Add delimited text layer” plugin that ships with the software. Yay!

References

DBpedia ⇒GO
DBpedia – A Crystallization Point for the Web of Data ⇒GO
Sparql Python Wrapper ⇒GO
RDF W3c ⇒GO
SPARQL W3c ⇒GO
Source at GitHub ⇒GO

Advertisements

7 Comments

Filed under code, gis, Maps

7 responses to “Europe Countries – DbPedia+SPARQL+Python

  1. Very interesting dataset. Thanks for this post! Do you know why there seem to be no points in Czech Republic?

    • kxtells

      Hello!

      Indeed, there are points on the Czech Republic Try here. Actually around 5000 points!

      It seems that the dbpedia resource for the Czech Republic does not have the rdf:type yago:EuropeanCountries, that is why is not listed as one of my input Countries. I should try with another predicate, like dcterms:subject category:Member_states_of_the_European_Union

      Salut!

  2. Pingback: DBPedia World | Castells

  3. Pingback: DBpedia « Semantic Web & Databases in Information Technology

  4. fleur

    i want extract the superclass of entité using DBpedia, sparql and python 2.7. ig france: country, nokia: company…. what is the query that allows me to get it

  5. Pingback: Europe DbPedia 2017 | Castells

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s