Re: Cyoc down
to give a quick update about this small project. I have temporarily stopped scanning for new pages on the web archive. I'm now running a separate script that goes through and updates the author of what I have. Since I've decided that it would take way to freaking long to wait for the initial scan to finish and then start adding authors.
I'm currently around 170,000+ entries in my db, so even updating all of the authors is going to take a long ass while. Probably the rest of the week by my estimates. Once I get all of the authors for what I currently already have in my db. Then I'll update my viewer to include the ability to search by author. Maybe a checkbox or something.
Now, onto an interesting problem I've run into that in hindsight I probably should have expected. At 170,000+ rows on the sqlite db, because of the way I'm searching, it takes about 4 - 5 minutes to complete a full search. Now for the why this is happening. It's because I'm not using simplified string searches, I'm using regex ontop of synonym phrases. So if someone searches "Breasts shrinking" it also searching for: breasts getting smaller, chest shrinking, bra getting loose, ect. and it tries to match those exact phrases. So I'm performing one giant regex search on a row and it takes a sec. I've already sped this WAY up since version 0.5 when I increased the accuracy of results by adding multi threading and a few other tricks.
I'll be releasing version 1 like this, but I'll eventually include a version 2. Version 2 will have an interesting feature that should speed up searches. The idea actually came from a buddy of mine. When someone does a synonym search, version 2 will apply a new tag to the row it finds the synonym in. It'll also save the id in a separate json array that's basically saying, "I found a synonym of breast shrinking in the contents of this row id" this way, in the future, even if new entries are added to the db. It'll always bring those search results up quickly. It doesn't need to rescan other contents either, since I plan to create a sort of checkpoint in the array. You just search "breasts shrinking" and you should get results within seconds instead of minutes. But that's for version 2. I'll eventually make all of this code public and easily accessible. I'm still adding comments to the code and reorganizing it to make it easier to read. right now it looks more like spaghetti. lol
__________________
Last edited by godleydemon; 1 Week Ago at 09:38 PM.
|