DBpedia_GraphMeasures

Source Dataset

Recently, we have been working on the DBpedia / Wikipedia Page Link dataset. We have considered the English and the German language versions for this project. In the current DBpedia 2014 page links English and German datasets 19 million and 7 million entities are represented respectively. But the original DBpedia only contains about 4 million and 1 million distinct entities for English and German versions.

This significant difference is mainly due to the current DBpedia pagelinks dataset include redirect pages and pagelinks with resources that are not considered as entites (as e.g. thumbnails and other images). So we considered cleaning up DBpedia pagelinks dataset for the computation of statistical parameters (a.g. pagerank or HITS). For the Cleanup we have removed all unnecessary and redundant RDF-Triples from the pagelinks dataset, i.e all removing the redirect pages (Redirection pages are just URIs that automatically forward a user to another Wikipedia page, but do not represent entities) as well as RDF-Triples representing resources that do not have an own rdfs:label (as per DBpedia documentation every entity has an rdfs:label reference ref).

One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and the other removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts.

Benefits of having DBpedia page links cleaned dataset are now it is much faster to compute graph measures, no need to compute graph measures for resources which are not actually resources and importantly no need of worry much about OutOfMemory Errors :-). Furthermore, we have used this dataset to compute Pagerank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets below.

Citation
If you are using any of these datasets please cite as:

{dbpedia-graphmeasures,
Author = {Dinesh Reddy, Magnus Knuth, Harald Sack},
Title = {DBpedia GraphMeasures},
Location = {http://semanticmultimedia.org/node/6},
Resource type = {dataset},
Publisher = {Hasso Plattner Institute},
Publication date = {July 2014},
}

English DBpedia 2015 GraphMeasure Datasets and their details

File type Number of triples File Memory Format
page_links_cleaned_en.ttl.bz2 137,394,064 1.2 GB Turtle
pagerank_scores_en.ttl.bz2 7,330,770 103 MB Turtle
hits_scores_en.ttl.bz2 7,330,770 94 MB Turtle
page_inlink_count_cleaned_en.ttl.bz2 6,810,408 67 MB Turtle
page_outlink_count_cleaned_en.ttl.bz2 4,799,262 48 MB Turtle

Table 1: English DBpedia 2015 GraphMeasure Datasets and their details

German DBpedia 2015 GraphMeasure Datasets and their details

File type Number of triples File Memory Format
page_links_cleaned_de.ttl.bz2 45,218,682 385 MB Turtle
pagerank_scores_de.ttl.bz2 2,223,591 32 MB Turtle
hits_scores_de.ttl.bz2 2,223,591 30 MB Turtle
page_inlink_count_cleaned_de.ttl.bz2 2,051,663 19 MB Turtle
page_outlink_count_cleaned_de.ttl.bz2 1,768,829 17 MB Turtle

Table 1: German DBpedia 2015 GraphMeasure Datasets and their details

English DBpedia 2014 GraphMeasure Datasets and their details

File type Number of triples File Memory Format
page_links_cleaned_en.ttl.bz2 131,182,964 1.1 GB Turtle
pagerank_scores_en.ttl.bz2 5,544,757 81 MB Turtle
hits_scores_en.ttl.bz2 5,544,757 76 MB Turtle
page_inlink_count_cleaned_en.ttl.bz2 5,130,711 52 MB Turtle
page_outlink_count_cleaned_en.ttl.bz2 4,582,685 45 MB Turtle

Table 1: English DBpedia 2014 GraphMeasure Datasets and their details

English DBpedia2014 GraphMeasures are now accessible via DBpedia SPARQL Endpoint :-)

The English DBpedia2014 GraphMeasure Datasets have been imported to the DBpedia with unique Graph IRI's. For instance
DBpedia2014 PageRank Scores dataset is imported to the DBpedia with a Graph IRI http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pagerank_scores_en_2014.ttl.bz2

As while doing SPARQL please use the following Named Graph IRI's for respective Datasets.

PageRank Scores - http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pagerank_scores_en_2014.ttl.bz2
HITS Scores - http://dbpedia.semanticmultimedia.org/dbpedia2014/en/hits_scores_en_2014.ttl.bz2
PageInLink Counts - http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageinlinkCounts_en_2014.ttl.bz2
PageOutLink Counts - http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2

Example SPARQL queries over DBpedia Endpoint

Give me top 100 DBpedia resources with high PageRank Scores
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?pagerank
FROM <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pagerank_scores_en_2014.ttl.bz2>
WHERE
{
?resource dbpedia-owl:wikiPageRank ?pagerank .
}
ORDER BY DESC (?pagerank)
LIMIT 100

Give me top 100 DBpedia resources with high HITS Scores
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?hitsScore
FROM <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/hits_scores_en_2014.ttl.bz2>
WHERE
{
?resource dbpedia-owl:wikiHITS ?hitsScore.
}
ORDER BY DESC (?hitsScore)
LIMIT 100

Give me top 100 DBpedia resources with high PageInLink Counts
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?pageInLinkCount
FROM <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageinlinkCounts_en_2014.ttl.bz2>
WHERE
{
?resource dbpedia-owl:wikiPageInLinkCountCleaned ?pageInLinkCount.
}
ORDER BY DESC (?pageInLinkCount)
LIMIT 100

Give me top 100 DBpedia resources with high PageOutLink Counts
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?pageOutLinkCount
FROM <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2>
WHERE
{
?resource dbpedia-owl:wikiPageOutLinkCountCleaned ?pageOutLinkCount.
}
ORDER BY DESC (?pageOutLinkCount)
LIMIT 100

German DBpedia 2014 GraphMeasure Datasets and their details

File type Number of triples File Memory Format
page_links_cleaned_de.ttl.bz2 50,286,771 419 MB Turtle
pagerank_scores_de.ttl.bz2 1,934,904 28 MB Turtle
hits_scores_de.ttl.bz2 1,934,904 27 MB Turtle
page_inlink_count_cleaned_de.ttl.bz2 1,781,642 17 MB Turtle
page_outlink_count_cleaned_de.ttl.bz2 1,679,790 16 MB Turtle

Table 1: German DBpedia 2014 GraphMeasure Datasets and their details

English DBpedia 3.9 GraphMeasure Datasets and their details

File type Number of triples File Memory Format
page_links_cleaned_en.ttl.bz2 119,884,298 1.05 GB Turtle
pagerank_scores_en.ttl.bz2 4,984,633 75.3 MB Turtle
hits_scores_en.ttl.bz2 4,984,633 71 MB Turtle
page_inlink_count_cleaned_en.ttl.bz2 4,622,470 47.9 MB Turtle
page_outlink_count_cleaned_en.ttl.bz2 4,256,661 43.4 MB Turtle

Table 1: English DBpedia GraphMeasure Datasets (3.9) and their details

German DBpedia GraphMeasure Datasets (3.9) and their details

File type Number of triples File Memory Format
page_links_cleaned_de.ttl.bz2 42,829,779 356 MB Turtle
pagerank_scores_de.ttl.bz2 1,689,764 24 MB Turtle
hits_scores_de.ttl.bz2 1,689,764 23 MB Turtle
page_inlink_count_cleaned_de.ttl.bz2 1,554,037 14 MB Turtle
page_outlink_count_cleaned_de.ttl.bz2 1,537,293 14 MB Turtle

Table 1: German DBpedia GraphMeasure Datasets (3.9) and their details

Implementation of PageRank and HITS

JUNG — the Java Universal Network/Graph Framework is used to compute pagerank and HITS scores. JUNG is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.

You can get the source code at JUNG PageRank and HITS Implementation

Parameters used while computing Pagerank


Damping factor: 0.85 // The probability at any step, that the person will continue
No of iterations: 100 // Number of iterations used before terminating
Tolerance: 0 // Minimum change from one step to the next
Alpha: 0.15 // Random jump probability, the probability of taking a random jump to an arbitrary vertex

Parameters used while computing HITS

No of iterations: 100 // Number of iterations used before terminating
Tolerance: 0 // Minimum change from one step to the next
Alpha: 0.15 // The probability of a hub giving some authority to all vertices, and of an authority increasing
the score of all hubs (not just those connected via links)

Schema

Same schema is used for both English and German DBpedia datasets assuming both will be uploaded into two different graphs.

Example triple from each dataset

Each dataset has only one distinct predicate. Please find an example triple from each dataset below

DBpedia PageLinks Cleaned Dataset
<http://de.dbpedia.org/resource/Anschluss_(Soziologie)>
<http://dbpedia.org/ontology/wikiPageWikiLink> <http://de.dbpedia.org/resource/Niklas_Luhmann>.

DBpedia PageRank Dataset
<http://dbpedia.org/resource/Category:Living_people>
<http://dbpedia.org/ontology/wikiPageRank> "0.002395068441893186"^^<http://www.w3.org/2001/XMLSchema#float> .

DBpedia HITS Dataset
<http://dbpedia.org/resource/List_of_University_of_Pennsylvania_people>
<http://dbpedia.org/ontology/wikiHITS> "0.005060736032883985"^^<http://www.w3.org/2001/XMLSchema#float> .

DBpedia PageInLinkCount Cleaned Dataset
<http://dbpedia.org/resource/American_History_High_School>
<http://dbpedia.org/ontology/wikiPageInLinkCountCleaned> "2"^^<http://www.w3.org/2001/XMLSchema#integer> .

DBpedia PageOutLinkCount Cleaned Dataset
<http://dbpedia.org/resource/Changzhou_Ancient_Canal>
<http://dbpedia.org/ontology/wikiPageOutLinkCountCleaned> "13"^^<http://www.w3.org/2001/XMLSchema#integer> .

SPARQL Queries

The following SPARQL queries can be used for both the English and German DBpedia datasets if they are loaded into two different graphs .

Give me resources and their pagelinks
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?resource ?pagerank
WHERE
{ ?resource dbpedia-owl:wikiPageWikiLink ?pagelink. }
LIMIT 100

Give me resources and their pagerank scores
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?resource ?pagerank
WHERE
{ ?resource dbpedia-owl:wikiPageRank ?pagerank. }
ORDER BY DESC (?resource)
LIMIT 100

Give me resources and their HITS(Hub and Authority) scores

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?resource ?hitsScore
WHERE
{ ?resource dbpedia-owl:wikiHITS ?hitsScore. }
ORDER BY DESC (?resource)
LIMIT 100

Give me resources and their pageInLink counts
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?pageInLinkCount
WHERE
{ ?resource dbpedia-owl:wikiPageInLinkCountCleaned ?pageInLinkCount. }
ORDER BY DESC (?resource)
LIMIT 100

Give me resources and their pageOutLink counts
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?resource ?pageOutLinkCount
WHERE
{ ?resource dbpedia-owl:wikiPageOutLinkCountCleaned ?pageOutLinkCount. }
ORDER BY DESC (?resource)
LIMIT 100

Creative Commons License
DBpedia GraphMeasure Datasets is licensed under a Creative Commons Attribution 4.0 International License.