Creating semanticizest model is slow #11

IsaacHaze · 2014-11-24T10:39:22Z

I'm too impatient for this...

isijara1@zookst23 time python -m semanticizest.parse_wikidump nlwiki-20141120-pages-articles.xml.bz2 nlwiki_20141120.sqlite3 2>&1 | tee nlwiki_20141120.log
Processing articles:
1000 at 2014-11-24 10:09:05.971042
(...)

real    75m58.878s
user    51m38.563s
sys 24m3.278s
isijara1@zookst23 cat /proc/cpuinfo  | grep -i intel | sort | uniq
model name  : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
vendor_id   : GenuineIntel

IsaacHaze · 2014-11-24T10:40:21Z

I'll give it another try using an unpacked dump.

IsaacHaze · 2014-11-24T12:04:02Z

Same thing for an uncompressed dump.

isijara1@zookst23 time python -m semanticizest.parse_wikidump nlwiki-20141120-pages-articles.xml nlwiki_20141120.unpacked.sqlite3 2>&1 | tee nlwiki_20141120.unpacked.log
(...)

real    70m35.425s
user    46m24.507s
sys 23m54.258s

larsmans · 2014-11-24T13:33:45Z

Yes, it's slow. My line-by-line timings showed 80% of the time being spent in calls into sqlite3 methods. Maybe #10 can solve some of this?

Other options for optimization:

compute n-grams in Cython, speedup of 1/3 in ngrams_with_pos, but that's not the bottleneck at model creation time
stronger normalization of tokens prior to computing n-grams so we have to pass fewer to SQLite (maybe Unicode normalization, removing diacritics, lowercasing)

IsaacHaze · 2014-11-24T13:50:46Z

My first guess would be to try to reduce the sqlite calls, batch them up (esp. the insert+update pairs.)

IsaacHaze · 2014-11-24T13:55:00Z

The other thing worth trying is not creating the index for redirection when we create the db. Meaning:

create db
perform inserts & updates
create index
handle redirects

So we don't have to update that index after processing each page (and still have it for the redirection handling.) But this might also make the updates in step 2 slower...

Indexing the first 10000 articles from an nlwiki dump previously took: real 1m17.655s user 0m48.409s sys 0m25.430s Now: real 0m46.042s user 0m40.603s sys 0m4.408s

larsmans · 2014-11-24T14:01:44Z

~~Yes, we can create the link_target index after doing all the inserts. We only need the target_anchor index during population, and we drop that afterwards.~~ No, we need both indices at insertion time: link_target for the redirects, the other one because we SELECT when updating.

Here are some more SQLite performance tricks. I just turned journaling off for a ~40% speedup.

I'm not yet sure how the Python bindings handle transactions, but the docs are here.

piskvorky · 2014-11-24T15:34:49Z

For what it's worth, I played with implementing an sqlite backend to the old semanticizer (not semanticizest!). It was too slow, the data access patterns were too random => lots of cache misses => lots of slow disk access.

Probably not related to the current version (this was for live ngram resolution IIRC, nothing to do with wiki preprocessing), and I assume you're using different data structures now and all.

Also, counting table rows is super slow in sqlite (full table scan, unless rows are never deleted), though I'm sure you're well aware of that, just saying :)

larsmans · 2014-11-24T15:46:05Z

The n-gram counts for non-links are going to replaced by count-min sketches. Storing them explicitly takes too much space, and we only need to retrieve the frequencies.

piskvorky · 2014-11-24T17:22:08Z

Ah, ok. What implementation do you plan to use Lars?

I'm asking because @mfcabrera wants to add min sketches to gensim, and @PitZeGlide is working on (practical) extensions. Maybe there's room for some collaboration there?

IsaacHaze · 2014-11-24T19:56:17Z

=D I like the probabilistic (approximate) counting. And it seems that the lower relative error of the log version fits our use case nicely, since we'll be using these counts primarily to compute probabilities.

Regarding @PitZeGlide numbers, how do i read those? He counted 1.1e6 things (of which 60e3 unique ones)? And the resulting sketches are 8_5000_4bytes = 160kbytes?

Which implementation? My vote is for simplest one that'll work, and optimize after we have a working version.

graus · 2014-11-27T13:17:09Z

(it might be slow, but I don't think this should be a high priority thing -- parsing a Wikidump overnight is totally acceptable for the typical use-case, I can imagine)

tgalery · 2015-03-29T00:59:48Z

Hi there, sorry to intrude but I have a few questions about the performace in model creation. I have downloaded an english wikipedia dump and am running the extraction process. It processed around 3 million pages and then disk started swapping. After around 0.5 million pages more, the process consumed all virtual ram and all the swap space. CPU usage is virtually nothing. Is there a minimum ram requirement to run this ? Any reason why so much ram is being used ?
Thanks in advance.

larsmans · 2015-04-09T12:06:55Z

That's because we're storing massive numbers of n-grams. We're switching to a hash-based implementation, but since we couldn't get that to perform well in Python, we're also rewriting the dumpparser in Go: https://github.com/semanticize/dumpparser

The client code still has to be adapted to work with the hash-based dumps.

IsaacHaze added the enhancement label Nov 24, 2014

larsmans added a commit that referenced this issue Nov 24, 2014

turn SQLite journaling off, for performance (gh-11)

8ed7a84

Indexing the first 10000 articles from an nlwiki dump previously took: real 1m17.655s user 0m48.409s sys 0m25.430s Now: real 0m46.042s user 0m40.603s sys 0m4.408s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating semanticizest model is slow #11

Creating semanticizest model is slow #11

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

larsmans commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

larsmans commented Nov 24, 2014

piskvorky commented Nov 24, 2014

larsmans commented Nov 24, 2014

piskvorky commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

graus commented Nov 27, 2014

tgalery commented Mar 29, 2015

larsmans commented Apr 9, 2015

Creating semanticizest model is slow #11

Creating semanticizest model is slow #11

Comments

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

larsmans commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

larsmans commented Nov 24, 2014

piskvorky commented Nov 24, 2014

larsmans commented Nov 24, 2014

piskvorky commented Nov 24, 2014

IsaacHaze commented Nov 24, 2014

graus commented Nov 27, 2014

tgalery commented Mar 29, 2015

larsmans commented Apr 9, 2015