News

NEw collection available for INEX 2007

Wikipedia XML Corpus

Introduction

We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both, INEX 2007 and the XML Document Mining Challenge. You can find a description of the corpus in this article (published in SIGIR Forum)

The bibtext of this Technical Report is:

@article{wikipediaxml:2005,
author = {Ludovic Denoyer and Patrick Gallinari},
title = {{T}he {W}ikipedia {X}{M}{L} {C}orpus},
journal = {SIGIR Forum},
year = {2006}
}

If you use the corpus, please include a citation of this report into your articles.

Browse the collections

You can access to the collections here

Getting a Login/Password

In order to download the corpus, you must get a login/password HERE

Search Engine over the collection

You can access HERE to a search engine in order to browser the collections

Information for the INEX participants

The INEX Corpus is a part of the whole Wikipedia XML Collection. In order to access to the INEX 2007 Collection (Main english collection) you have to obtain a specific login/password (different that the one from INEX).

Download

If you have a login and a password, you can download the different parts of the corpus.
Login :
Password :

From INEX 2006 to INEX 2007

If you have already indexed the 2006 INEX collection, you don't need to download this one.

English textual corpus - INEX 2007 Corpus

This corpus is composed of more than 600,000 XML documents in english

Main corpus (INEX 2007 Corpus)
INEX 2007 Collection (same as old collection with Image IDs)
Categorization
All Categories
Multi-Label Corpus
Single-Label Corpus (also downloadable on the XMLMining Challenge web site)
Train Part
Test Part (available on may 2006)
Entiy Corpus
Multimedia Corpus
Part 1/19
Part 2/19
Part 3/19
Part 4/19
Part 5/19
Part 6/19
Part 7/19
Part 8/19
Part 9/19
Part 10/19
Part 11/19
Part 12/19
Part 13/19
Part 14/19
Part 15/19
Part 16/19
Part 17/19
Part 18/19
Part 19/19 (XML Files)

German textual corpus

This corpus is composed of 350,000 documents in german

Main corpus
German Corpus
Categorization
All Categories

French textual corpus

This corpus is composed of 110,000 documents in french

Main corpus
French Corpus
Categorization
All Categories

Dutch textual corpus

This corpus is composed of 125,000 documents in dutch

Main corpus
Dutch Corpus
Categorization
All Categories

Spanish textual corpus

This corpus is composed of 350,000 documents in german

Main corpus
Spanish Corpus
Categorization
All Categories

Chinese textual corpus

This corpus is composed of 56,000 documents in simplified chinese

Main corpus
Chinese Corpus
Categorization
All Categories

Arabic textual corpus

This corpus is composed of 11,000 documents in arabic

Main corpus
Arabian Corpus

Japanese textual corpus

This corpus is composed of 187,000 documents in japanese

Main corpus
Japanese Corpus
Categorization
All Categories




Ludovic DENOYER