Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating semanticizest model is slow #11

Open
IsaacHaze opened this issue Nov 24, 2014 · 13 comments
Open

Creating semanticizest model is slow #11

IsaacHaze opened this issue Nov 24, 2014 · 13 comments

Comments

@IsaacHaze
Copy link
Contributor

I'm too impatient for this...

isijara1@zookst23 time python -m semanticizest.parse_wikidump nlwiki-20141120-pages-articles.xml.bz2 nlwiki_20141120.sqlite3 2>&1 | tee nlwiki_20141120.log
Processing articles:
1000 at 2014-11-24 10:09:05.971042
(...)

real    75m58.878s
user    51m38.563s
sys 24m3.278s
isijara1@zookst23 cat /proc/cpuinfo  | grep -i intel | sort | uniq
model name  : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
vendor_id   : GenuineIntel
@IsaacHaze
Copy link
Contributor Author

I'll give it another try using an unpacked dump.

@IsaacHaze
Copy link
Contributor Author

Same thing for an uncompressed dump.

isijara1@zookst23 time python -m semanticizest.parse_wikidump nlwiki-20141120-pages-articles.xml nlwiki_20141120.unpacked.sqlite3 2>&1 | tee nlwiki_20141120.unpacked.log
(...)

real    70m35.425s
user    46m24.507s
sys 23m54.258s

@larsmans
Copy link
Contributor

Yes, it's slow. My line-by-line timings showed 80% of the time being spent in calls into sqlite3 methods. Maybe #10 can solve some of this?

Other options for optimization:

  • compute n-grams in Cython, speedup of 1/3 in ngrams_with_pos, but that's not the bottleneck at model creation time
  • stronger normalization of tokens prior to computing n-grams so we have to pass fewer to SQLite (maybe Unicode normalization, removing diacritics, lowercasing)

@IsaacHaze
Copy link
Contributor Author

My first guess would be to try to reduce the sqlite calls, batch them up (esp. the insert+update pairs.)

@IsaacHaze
Copy link
Contributor Author

The other thing worth trying is not creating the index for redirection when we create the db. Meaning:

  1. create db
  2. perform inserts & updates
  3. create index
  4. handle redirects

So we don't have to update that index after processing each page (and still have it for the redirection handling.) But this might also make the updates in step 2 slower...

larsmans added a commit that referenced this issue Nov 24, 2014
Indexing the first 10000 articles from an nlwiki dump previously took:

real    1m17.655s
user    0m48.409s
sys     0m25.430s

Now:

real    0m46.042s
user    0m40.603s
sys     0m4.408s
@larsmans
Copy link
Contributor

Yes, we can create the link_target index after doing all the inserts. We only need the target_anchor index during population, and we drop that afterwards. No, we need both indices at insertion time: link_target for the redirects, the other one because we SELECT when updating.

Here are some more SQLite performance tricks. I just turned journaling off for a ~40% speedup.

I'm not yet sure how the Python bindings handle transactions, but the docs are here.

@piskvorky
Copy link
Member

For what it's worth, I played with implementing an sqlite backend to the old semanticizer (not semanticizest!). It was too slow, the data access patterns were too random => lots of cache misses => lots of slow disk access.

Probably not related to the current version (this was for live ngram resolution IIRC, nothing to do with wiki preprocessing), and I assume you're using different data structures now and all.

Also, counting table rows is super slow in sqlite (full table scan, unless rows are never deleted), though I'm sure you're well aware of that, just saying :)

@larsmans
Copy link
Contributor

The n-gram counts for non-links are going to replaced by count-min sketches. Storing them explicitly takes too much space, and we only need to retrieve the frequencies.

@piskvorky
Copy link
Member

Ah, ok. What implementation do you plan to use Lars?

I'm asking because @mfcabrera wants to add min sketches to gensim, and @PitZeGlide is working on (practical) extensions. Maybe there's room for some collaboration there?

@IsaacHaze
Copy link
Contributor Author

=D I like the probabilistic (approximate) counting. And it seems that the lower relative error of the log version fits our use case nicely, since we'll be using these counts primarily to compute probabilities.

Regarding @PitZeGlide numbers, how do i read those? He counted 1.1e6 things (of which 60e3 unique ones)? And the resulting sketches are 8_5000_4bytes = 160kbytes?

Which implementation? My vote is for simplest one that'll work, and optimize after we have a working version.

@graus
Copy link

graus commented Nov 27, 2014

(it might be slow, but I don't think this should be a high priority thing -- parsing a Wikidump overnight is totally acceptable for the typical use-case, I can imagine)

@tgalery
Copy link

tgalery commented Mar 29, 2015

Hi there, sorry to intrude but I have a few questions about the performace in model creation. I have downloaded an english wikipedia dump and am running the extraction process. It processed around 3 million pages and then disk started swapping. After around 0.5 million pages more, the process consumed all virtual ram and all the swap space. CPU usage is virtually nothing. Is there a minimum ram requirement to run this ? Any reason why so much ram is being used ?
Thanks in advance.

@larsmans
Copy link
Contributor

larsmans commented Apr 9, 2015

That's because we're storing massive numbers of n-grams. We're switching to a hash-based implementation, but since we couldn't get that to perform well in Python, we're also rewriting the dumpparser in Go: https://github.com/semanticize/dumpparser

The client code still has to be adapted to work with the hash-based dumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants