Author Topic: Improving search  (Read 7899 times)

Offline knnn

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 4946
    • View Profile
Improving search
« on: June 30, 2015, 06:20:23 PM »
I think it's not blasphemy to declare that the search function on this site is sorely lacking  in that using the search functionality will yield very limited (and sometimes completely wrong) results.

Now I can totally understand why (e.g.) Google search might have been disabled from the forums -- it drives a lot of unnecessary traffic, and possibly increases the cost of running the site (having to pay for page-views every time a web-crawler goes through every page).

That said, I think it might be within my capability to put together a private, local search for the forums.  Depending on the technology used, it would not necessarily be generally accessible and could be updated only as often as requested. 

This is mainly just me having fun playing around with some tech tools and maybe coming up with something remotely useful.  The things I am asking permission for are:

1) Any fast websearch needs to "crawl" every post on the forums periodically to maintain its list of searchable keywords.  This means a potentially large amount (tens of thousands?) of queries to the forum server every time the list gets updated.  Is it ok to do this (say once a month)?

2) While anyone can post a link online to any topic on the (publicly accessible parts of the) forums, this would be essentially making large parts of the forum a lot more accessible to the public in a targeted manner.  Is there a problem with this?

---

Note that I only have vague random thoughts about how to proceed about doing this.  I am asking first just to make completely sure there isn't any problem.  Otherwise, I can always go and play with other toys.   ;)
DV Geek code:

DV knnn v1.2 YR4 FR3 BK++ RP+ JB+ TH WG+ CL(+) SW++++ BC- MC---(+) SH[Murphy+, Molly+]

Find out your Dresden Files "Purity" score: http://knnn.x10.mx/purity2/purity.html

Offline Iam that kemmler

  • Conversationalist
  • **
  • Posts: 206
    • View Profile
Re: Improving search
« Reply #1 on: July 07, 2015, 01:23:45 AM »
using toys or creating crawlers would just add to cost of hosting. If you had access to the actual database and if all the proper indexing was already done - then a page with an interface to search on would be actually be pretty simple to do.

Many folks have wanted to help shape stuff around here - but Iago rarely acknowledges or goes with tech advice.

If you really wanted to build a tool - you could just load up SMF and check out the data structure and write the page as proof of concept. I'm pert damn sure it'll get shot down because of the hidden forums on this site.

(I own a company that does custom programming and I have offered to help before)

Offline Serack

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 7745
  • WoJ Rock Star!
    • View Profile
Re: Improving search
« Reply #2 on: July 09, 2015, 02:57:31 PM »
using toys or creating crawlers would just add to cost of hosting. If you had access to the actual database and if all the proper indexing was already done - then a page with an interface to search on would be actually be pretty simple to do.

Many folks have wanted to help shape stuff around here - but Iago rarely acknowledges or goes with tech advice.

If you really wanted to build a tool - you could just load up SMF and check out the data structure and write the page as proof of concept. I'm pert damn sure it'll get shot down because of the hidden forums on this site.

(I own a company that does custom programming and I have offered to help before)

Edit:  Doh, accidentally hit post without writing anything

Yah, Iago resists doing changes here.

That said, he has commented in the past that on top of how heinous it is on the user side, the search function is a huge use of server resources, which is why he limits it to one search per 60 seconds, and only 25 hits. 

I'm not quite sure what knnn is proposing, but it kinda sounds like he might be asking to crawl the site, and host a separate search function, which could eliminate the need for the current search function and it's toll on the server...

Change is unlikely.
« Last Edit: July 09, 2015, 03:10:28 PM by Serack »
DF WoJ Compilation
Green is my curator voice.
Name dropping "Serack" in a post /will/ draw my attention to it

*gnaws on the collar of his special issue Beta Foo long-sleeved jacket*

Offline knnn

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 4946
    • View Profile
Re: Improving search
« Reply #3 on: July 09, 2015, 05:51:43 PM »

I'm not quite sure what knnn is proposing, but it kinda sounds like he might be asking to crawl the site, and host a separate search function, which could eliminate the need for the current search function and it's toll on the server...

Change is unlikely.

Exactly.  The idea would be to crawl the site infrequently (say once a month), and cache all the posts on a different site with links to the original posts.  That way, any search will just use resources from the alternate site.   Sure, people would still be following the links to posts on this site, but that shouldn't be any different than some poor user trying to search through the last 100 pages of posts looking for something specific. 

The only real disadvantage I can see with this scheme (other than the aforementioned need to crawl every post on this site periodically) is that with a search that doesn't suck people might begin to expect they can actually find old posts and maybe use the site more often than we want them to.
« Last Edit: July 09, 2015, 05:54:07 PM by knnn »
DV Geek code:

DV knnn v1.2 YR4 FR3 BK++ RP+ JB+ TH WG+ CL(+) SW++++ BC- MC---(+) SH[Murphy+, Molly+]

Find out your Dresden Files "Purity" score: http://knnn.x10.mx/purity2/purity.html

Offline knnn

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 4946
    • View Profile
Re: Improving search
« Reply #4 on: July 09, 2015, 06:04:05 PM »
Frankly, I'm tempted to just go and test my theories on the DF reference subforum.  It changes quite infrequently and is an order of magnitude smaller than the full site, so I can probably do a one-time "one thread an hour" indexing that would barely even show up as background noise.   

This would also be much easier to host separately/locally (*way* less space to take up), thus allowing me to play with stuff safely.  And if it actually worked, would be a reasonable proof-of-concept test for the full treatment, so we wouldn't just be yakking about pie-in-the-sky ideas.
DV Geek code:

DV knnn v1.2 YR4 FR3 BK++ RP+ JB+ TH WG+ CL(+) SW++++ BC- MC---(+) SH[Murphy+, Molly+]

Find out your Dresden Files "Purity" score: http://knnn.x10.mx/purity2/purity.html

Offline Iam that kemmler

  • Conversationalist
  • **
  • Posts: 206
    • View Profile
Re: Improving search
« Reply #5 on: July 10, 2015, 04:42:20 AM »
Crawling the site vs indexed searches I can assure you that crawling uses up more server time since you are building the web page, indexing it (writing all information to your database) , and moving to the next web page and indexing that (writing all information to your database)- rinse/lather/repeat. Not only are you pulling data, you are also using resources to build the page.

If you were doing that to any of my sites, I'd bounce your bot after 20 requests with a redirect to a random choice of search engines. Essentially, you want to take all of the data generated by making all of the pages show up and taking that information. That is far more invasive than just running a search query because each query is doing a search to get the data that is needed to generate the web page. If you are concerned about server load - a proper search tool on a website is less resource intensive than a bot using the current search tool to get the data plus actually creating the page.




« Last Edit: July 10, 2015, 04:53:12 AM by Iam that kemmler »

Offline knnn

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 4946
    • View Profile
Re: Improving search
« Reply #6 on: July 10, 2015, 11:48:05 AM »
Yes, I have to build a page and load up all the overhead that comes with it, but in terms of server load I'd argue this is ultimately no different than me browsing the actual threads/posts with my web browser.  Thus even though the amount of of work I'm putting on the server is larger than if I'd used an indexed search, would me creating such a bot really be that problematic? 

Remember that I'd be setting it to read one thread per hour.  That way it's not pulling up the webpages any faster than a normal person who wanted to read the entire contents of the reference section which is 117 threads in all.  We're talking about reading in the entire reference section once over the course of a week.  Would that really be that much of an imposition or even noticable over the background noise?
DV Geek code:

DV knnn v1.2 YR4 FR3 BK++ RP+ JB+ TH WG+ CL(+) SW++++ BC- MC---(+) SH[Murphy+, Molly+]

Find out your Dresden Files "Purity" score: http://knnn.x10.mx/purity2/purity.html

Offline Serack

  • Special Collections Division
  • Posty McPostington
  • ****
  • Posts: 7745
  • WoJ Rock Star!
    • View Profile
Re: Improving search
« Reply #7 on: July 10, 2015, 04:02:36 PM »
Crawling the site vs indexed searches I can assure you that crawling uses up more server time since you are building the web page, indexing it (writing all information to your database) , and moving to the next web page and indexing that (writing all information to your database)- rinse/lather/repeat. Not only are you pulling data, you are also using resources to build the page.

If you were doing that to any of my sites, I'd bounce your bot after 20 requests with a redirect to a random choice of search engines. Essentially, you want to take all of the data generated by making all of the pages show up and taking that information. That is far more invasive than just running a search query because each query is doing a search to get the data that is needed to generate the web page. If you are concerned about server load - a proper search tool on a website is less resource intensive than a bot using the current search tool to get the data plus actually creating the page.

yah, I checked some of Iago's old posts on this subject, and it sounds like it took nearly a week to build the search indexes for the search engine when he started from scratch several years ago.
DF WoJ Compilation
Green is my curator voice.
Name dropping "Serack" in a post /will/ draw my attention to it

*gnaws on the collar of his special issue Beta Foo long-sleeved jacket*

Offline Iam that kemmler

  • Conversationalist
  • **
  • Posts: 206
    • View Profile
Re: Improving search
« Reply #8 on: July 11, 2015, 01:54:43 AM »
Yes, I have to build a page and load up all the overhead that comes with it, but in terms of server load I'd argue this is ultimately no different than me browsing the actual threads/posts with my web browser.  Thus even though the amount of of work I'm putting on the server is larger than if I'd used an indexed search, would me creating such a bot really be that problematic? 

Remember that I'd be setting it to read one thread per hour.  That way it's not pulling up the webpages any faster than a normal person who wanted to read the entire contents of the reference section which is 117 threads in all.  We're talking about reading in the entire reference section once over the course of a week.  Would that really be that much of an imposition or even noticable over the background noise?

Yep, you are correct - if you slow the bot to one request per hour then you spread the server hit to a minimum - assuming that Iago pays for thresholds from his hosting service would mean that it wouldn't directly impact his pocket.

Your ultimate goal is to create an offsite search engine - If I would help you do that I would have you create the queries to get the data you would be accessing - export them to a csv or other shared data source type and simply let you have them. Which is what I posted previously.

Since it's only 117 threads that you are after which contain what 20 to 30 replies on average? that's a lot of hours. I liked your Dresden Game KNN. Just offering some hints or tips for success. Fu.