Sequential version

To deal with big corpus:

  • sequential computation of the dictionary
  • sequential building of the numerical corpus

Building the pre-processor


Example of use: -sp NameOfTheClass -spargs list of the attributes

 java -jar v0_06_preprocessing.jar -fileout=SPbi.sp  -sp StringProcessor_MarkPunctuation 
 -sp StringProcessor_TreeTagger -spargs -path2tt=../../treetagger -path2ttmodel=../../treetagger/models/english.par 
 -sp StringProcessor_NGram -spargs -size=2

Detailed Options, see link

Building the dictionary sequentially


Example of use:

 java -Xmx5g -jar v0_06_dicoBuilding.jar -dicoFile=tmp/dicospUnisp.txt -nFilter=3
 -path2data=data/reviewsNew.txt -sizeBatch=100000 -spFile=SPuni.sp
  • Deals with the corpus reviewsNew.txt (Bing Liu's format of file,
  • Blocs of 100000 documents
  • Remove all terms apprearing less than 3 times
  • Use the preprocessing file SPuni.sp
  • Each block corresponds to a dictionary file (suffixed dicoFile names)

This creates one file for each block, then we need to merge those files:


 java -Xmx15g -jar v0_06_dicoFusion.jar -dicoFile=dicospBisp.txt -nFilter=20
  • NB: dicoFile for makeDicoFusion.jar = dicoFile for makeDicoSeq.jar

Building the numerical corpus sequentially (libsvm format)


Example of use:

 java -Xmx15g -jar v0_06_process.jar -dicoFile=tmp/dicospUnispR20.txt -spFile=SPuni.sp
 -path2data=data/reviewsNew.txt -sizeBatch=100000 -corpusNumFile=data/revuesUni.lib
  • corpusNumFile corresponds to standard libSVM file
  • last optional argument corresponds to a sequential version of the numerical corpus (references of words are kept in the original order