i want process wikipedia using gensim.corpora.wikicorpus. final objective train word2vec model it.
i have working have problem accented vowels of spanish: á, é, í, ó, ú.
i want normalize them a, e, i, o, u.
i have seem there deaccent function in gensim dwould apply directly while building corpus. can done?
here working example:
from gensim.corpora import wikicorpus gensim.models.word2vec import word2vec import logging logging.basicconfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.info) # include here normalization corpus = wikicorpus('/users/jesusfbes/desktop/eswiki-latest-pages-articles.xml.bz2', dictionary=false) max_sentence = -1 def generate_lines(): index, text in enumerate(corpus.get_texts()): if index < max_sentence or max_sentence == -1: yield text else: break model = word2vec(size=400, window=5, min_count=5) model.build_vocab(generate_lines()) model.train(generate_lines(), chunksize=500) model.save('mymodel')
i guess should along these lines:
from gensim.utils import deaccent def generate_lines(): index, text in enumerate(corpus.get_texts()): if index < max_sentence or max_sentence == -1: yield deaccent(text) else: break i store results of call generate_lines() efficiency long have enough ram store results , reuse them in both model.build_vocab() , model.train() calls.
Comments
Post a Comment