Sept 3 2013 Update: Freebase is about to switch from its Turtle format to RDF Triples. This post will only work with Freebase RDF Turtle format. See the discussion at https://groups.google.com/forum/#!topic/freebase-discuss/AG5sl7K5KBE.

Freebase is a knowledge base (database) containing over a billion facts. Freebase has its own query language called MQL (mickle) which is a template-based matching language. Though MQL supports recursion, it has a limited expressive power (see this discussion). A better alternative is SPARQL query language standardised by W3C. Freebase does not provide a SPARQL endpoint (query interface). You have to create your own endpoint to query Freebase in SPARQL. In order to do so, Freebase should be converted to a standardised RDF format and then loaded into a database which can support SPARQL. Thanks to Freebase Developers. They did most of the work for us in converting Freebase to RDF.

OpenLink Virtuoso is a great freely available database which supports SPARQL/RDF. Though Freebase provides its complete dump in RDF Turtle format, the RDF format they use is not according to W3C standards, making it harder to load into Virtuoso.

I have extracted few domains from Freebase dump and loaded them into virtuoso. You can the follow the same procedure for the complete dump.

  1. Download the dump from https://developers.google.com/freebase/data. Place in it a directory say /completePath/freebase_dump
  2. Install Virtuoso (Ubuntu, Fedora, Debian, Cent OS, Source). You have to change the variables NumberOfBuffers and MaxDirtyBuffers depending on your RAM. My guess is you will require at least 60 GB RAM to load complete Freebase. For my data, I worked with 40 GB RAM. On a machine with 8GB RAM, the machine froze when I tried loading 30 million facts (Freebase is over one billion facts). Also modify DirsAllowed adding the directory /completePath/freebase_dump
  3. Start virtuoso server using the command virtuoso-t -f. You should start it in the directory where virtuoso.ini is present.
  4. Download fix_freebase.py which converts Freebase Turtle format to a standardised version.
  5. Run the command zcat <dump.gz> | python fix_freebase.py | gzip > dump_fixed.gz
  6. Run the command isql-vt localhost:1111 dba dba to start Virtuoso SPARQL interpreter
    1. Run below commands in the SPARQL interpreter.
    2. DB.DBA.TTLP_MT (gz_file_open ('/completePath/freebase_dump/dump_fixed.gz'), '', 'http://my_desired_name.com', 128);. If you are using uncompressed dump files, use file_to_string_output instead of gz_file_open. If the loading fails, change 128 to the value depending on your error. Check out flags variable of DB.DBA.TTLP_MT at http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html. Loading takes time. You can open another SPARQL interpreter and count the number of facts that are loaded using the command SPARQL SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
    3. Once the loading is finished, run the commands checkpoint; and commit WORK; and checkpoint; and exit;

Congratulations, you have successfully loaded Freebase into Virtuoso. You can access the SPARQL endpoint at http://localhost:8890/sparql. I prefer using SPARQL interpreter from the command line.

Let me know how it goes and feel free to add additional links describing your experiences.

Credits: Spandana Gella for her help with Virtuoso installation.

Comments

Thanks a lot Jan. Hopefully

Thanks a lot Jan. Hopefully this works for others as well. It would be good if you can share your compiled virtuoso.db, and people can just download that and set up Freebase quickly without having to go through the process.

This must be a problem in

This must be a problem in your network or proxy or your file system. This is not virtuoso related problem.

If you load Freebase dump

If you load Freebase dump using Virtuoso, you can use jena library to run SPARQL queries on it. The code will look something like this.

import com.hp.hpl.jena.query.*;
import com.hp.hpl.jena.rdf.model.RDFNode;
import com.hp.hpl.jena.shared.PrefixMapping;
import com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP;

import virtuoso.jena.driver.*;

public ResultSet runQueryHttpResultSet(String query) {
Preconditions.checkArgument(httpUrl != null, "http endpoint not specified");
if (query == null)
return null;
query = "PREFIX xsd: " + query;
// Create Sparql query
ResultSet results = null;
try {
QueryEngineHTTP vqe = new QueryEngineHTTP(httpUrl, query);
vqe.addParam("timeout", this.timeOut.toString());
results = vqe.execSelect();
} catch (Exception e) {
System.err.println("http exception: " + e.getMessage());
}
return results;
}

httpUrl is something like http://localhost:8890/sparql

You can extract the results from a results set as follows.
while (results.hasNext()) {
QuerySolution result = results.nextSolution();
System.out.println(results.getResultVars());
System.out.println(result);
}

Site Counter