News
Wikipedia XML Corpus
Introduction
We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both,
INEX 2007 and the
XML Document Mining Challenge.
You can find a description of the corpus in this
article (published in SIGIR Forum)
The bibtext of this Technical Report is:
@article{wikipediaxml:2005,
author = {Ludovic Denoyer and Patrick Gallinari},
title = {{T}he {W}ikipedia {X}{M}{L} {C}orpus},
journal = {SIGIR Forum},
year = {2006}
}
If you use the corpus,
please include a citation of this report into your articles.
Browse the collections
You can access to the collections
here
Getting a Login/Password
In order to download the corpus, you must get a login/password
HERE
Search Engine over the collection
You can access
HERE to a search engine in order to browser the collections
Information for the INEX participants
The INEX Corpus is a part of the whole Wikipedia XML Collection. In order to access to the INEX 2007 Collection (Main english collection)
you have to obtain a specific login/password (different that the one from INEX).
Download
If you have a login and a password, you can download the different parts of the corpus.
Ludovic DENOYER