-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature wish: Add a search function (this is a race ;-)) #146
Comments
Nice avatar; great album. :) I think the short answer is that I don't have time to do this myself, but would welcome a pull request. The longer answer is, I completely understand the need, it's just not clear to me how much of this falls within the mission of Frog per se. I'd start by presuming people will:
Maybe there's some missing 2% that Frog could/should provide (like, I don't know, helping the indexer know what files to search, and/or where to store its output). If so, I'd be happy to add (or accept a PR for) that. |
Thank you. ;-) Well, I don't really speak Racket yet, so I was hoping I could delegate this. But your idea seems reasonable too. Would Sphinx work? |
I don't know much about full-text indexers. If you know (or could research) some good options, that would be a huge help, regardless of how well you know Racket. I guess the criteria would be:
How is Sphinx in that regard? Also: Google has some search-for-your-site thing, right? Why not use that? Even if that's inadequate, it would be nice to explain why (so people understand why it's worth the hassle of installing X to do this). |
Sphinx - the best solution I, personally, know - runs "everywhere", at least on Frog's target platforms. The front-end is to be designed by whoever feels like it, Sphinx "only" provides the server-side stuff. As a free (there are some commercial ones too) alternative, there's also Apache's Lucerne project with the Sphinx-like software solr. re Google/DDG: I could use them as a workaround but they have serious drawbacks:
|
re Google/DDG I understand/empathize. Thanks for articulating why. re Sphinx:
|
re Sphinx (too): Oh, that makes things harder. Generating the index needs some server-side logic (of course) as the indexer would have to actively search through the existing files. I'm not sure if Sphinx supports a "static" (schemaless) data output. Solr does. Else, there's still tipue search, entirely client-side and coded in jQuery. Maybe that's the last resort here? |
Well, I'm assuming that most people choosing to use a static blog generator, who don't want to run an HTTP server, would also not want to run some index query server. At least, I'd be in that camp. I was hoping that one would run the indexer after rebuilding the blog, and it would produce some sort of "database" file(s), that the JS could read and use to do fast queries. Again I know little about full-text search, and I only scanned the tipue docs very quickly. But it looks like its Static Mode wants JSON data in a I don't know, what do you think? |
How do you want to deploy your generated HTML files without a HTTP server?
I see what you mean. One of the two major reasons why I want to drop WordPress is that its server components are known for nasty security holes. Still, most of the search functions I can imagine could be realized as optional plug-ins/scripts for those who could live with that.
The closest you can have here -- at least if we're still talking about "real" search servers -- is running Solr in schemaless mode and storing its indexes locally, I guess.
The existing Pelican script looks like there's not much involved; yes, the HTML files are "scanned" and transformed into JSON. |
Many (most?) people using static blog generators use an HTTP server someone else is responsible for running 24/7. That's part of the appeal. They push to GitHub Pages, or copy to Amazon S3, or similar. So, if you're OK with the tipue UI/UX, it looks like Pelican already supports what you want! :) Generating that kind of JSON would be easy to do in Racket for Frog, as well. I can imagine adding that, either in Frog or simply as a stand-alone repo/tool. It would be handy. There are times I'd use it to find a post on my own blog more quickly, not to mention helping others. Oh heck, I'll assign this to myself, and try to get to it in the next day or two. Thanks again for the suggestion and for helping talk through the options. |
Ah, I misunderstood you there, yes. Pelican is quite wordpressy indeed, even feature-wise. But I don't like Python too much after having used it for a while and I'm not entirely happy with its theming, so I thought I'll push the alternatives a bit. Thank you a lot! |
I'm still interested in this, but having looked at tipue more, I'm not so sure about it, specifically. Its content file JSON is just an array of page maps, and the Maybe I'm misunderstanding the JavaScript and it converts that raw data into something better, but I don't think so. Hmmm.... |
Well, if you want to achieve a client-side full-text index, you'll have to deploy that full-text index in a way. Someone tested the Tipue performance and he seemed to be impressed though. Tipue basically works with JSON, yes. That's the main problem with JavaScript-based stuff: JavaScript has horrible data formats. |
In that issue you linked to (thanks!), this comment expresses another concern I have:
That he followed through with a solution is awesome -- kudos. That it requires running your own server, is not awesome (for me). |
The "solution" is the aforementioned Sphinx (and I guess it could also work with Solr), yes. Searching in a static site basically only leaves you with those two options: Simulate a database layer or keep a full-text index anywhere. But I guess Tipue's index is not really large, it only contains your pure text. Today's connections probably won't have much trouble with that...? |
I don't think you need to download the whole site, or the text of it, if you are willing to live with some limits. For instance, if you simply compute and store statically a table which maps from each unique word to a reference to the page it occurs in, then you can fetch that table and have some client-side code which then dynamically fetches that page, and searches linearly in it, on the fly. That means the initial fetch of stuff is not the whole site, but its unique words, which for large sites will be much smaller. (Of course in real life you'd want to be smarter than this (for instance don't index 'the' & so on: but I think there should be non-pessimal solutions, where 'pessimal' is either 'running some fancy, and therefore vulnerable, server-side thing, or downloading the whole site, each time'.) |
Exactly. That's one of two things I did last night. Against my better judgment I started reading more about IR. Assume a positional index like this. How does JS on a static web site avoid downloading the whole thing? Normally it queries a server, which someone has to maintain 24/7. I suppose that could be AWS Dynamo, but, not sure how to handle credentials. Also requires $. Would it be crazy to store this with the "posting lists" sharded across objects on an AWS S3 bucket? It seems that using S3 as a key/value store like this could be reasonably performant. [Using anonymous access simplifies creds. As for $, there are some, but fewer of them than say Dynamo.] So that could be an interesting project for someone to try (probably someone already has?). OTOH the second thing I did last night? I configured Google Custom Search Engine for my blog. The search query/results can be embedded in one of my normal web pages -- it didn't feel like "leaving my site". It looked decent and worked really well. I didn't push this to my site for real, yet. But.... Google CSE was ridiculously easy. I like easy. Does this make me a bad person? |
Yes, it pretty much does. Google is The Evil! Seriously, using a Static Site Generator usually means that you want to gain full control over your website and adding third-party components from servers you don't own is not really a better and/or more secure idea than running a dynamic server daemon yourself, is it? |
People do static web sites for different reasons. Some do it for more control, like you. Whereas I'm with The Dead Kennedys, Give Me Convenience or Give Me Death. Seriously, if I get into the server-running business my "users" won't be happy about availability and I won't be happy with about fire drills. I do understand the trade-off and have misgivings; I feel slightly ashamed if that makes you feel better. So for example I use Disqus to add comments and I use Google Analytics to see if my tree falling in the woods makes any sound. |
Because I feel slightly ashamed I'd gladly use a search system that hosted its database intelligently on Amazon S3, for example. If that doesn't exist, I'd find that really fun to develop. I just don't have time to, now. I'm already close to over-extended on open source projects. |
You could also run Sphinx on Amazon S3, would that validate my point then? ;) -e- Oh, good timing. |
I'm sorry if I overlooked that option in the discussion. I'll try to find time to look at that. Thanks for pointing it out. |
If there's anything I can help you with, I'm happy to do so. :) |
Honestly? I'd like a "dummies guide" to Sphinx, specifically how to make some JS front-end access the Sphinx database file(s) from a plain file server like FTP or S3. Whether that exists already, or you write it. It would be great if you could contribute the parts I don't know, and I can contribute what I know about how to integrate things smoothly into the Frog build process. |
After reading a bit: The database file(s) need to be generated first (see the "Indexes" part). Sphinx can even do real-time indexes without a database, according to the internet. However, having it expose a "human readable" database seems to be undocumented without accessing the Sphinx daemon. Similar to solr the API primarily exposes the server handle, it seems. However (I like that word), Sphinx can be instructed to generate index files as it seems. This is ... interesting. |
For what it's worth I'm in at least three camps here.
So, my point here, as far as I have one, is that I'm definitely in the extreme-static camp, and while search would be interesting I'm completely happy to not have it if it's not practical in a static system, and I'd like it to be the case that Frog continues to support that position. I'm not really worried that it won't, I just want to make sure it does. Sorry for the rant! |
Python is a very fast and quite reliable way to prototype things although its syntax is... weird. But I don't understand your last point: How does Googlebook know more about you when the publicly visible HTML pages are generated dynamically? No one forces you to use Google/Facebook things with WordPress, for example. |
The point I didn't make clearly was that I don't want search to rely on google, or if it does I want it to be easy to disable. (I have no problem with Python's syntax. I have a number of problems with its semantics which are just inexcusable: if it had been designed in the early 60s I could forgive it, but it wasn't. But this is very much not the forum to get into a fight about Python (apart from anything else writing Python is my day job and I don't want to think of it outside that).) |
@greghendershott Howdy!
I smiled when I read that. :) As a follow up to that thread you linked to, I'm considering turning what I built into a hosted service. One caveat is it means trusting my servers to do what I claim they do. My goal, however, is letting people hide the fact that my service is in the mix at all. Would building this service be a solution to this issue or not so much? |
I guess the whole point of a static website is that you can't trust a server if I got it correctly? |
@MTecknology Hi!
Speaking only for myself, for my personal blog? I don't think so. If I wanted to add search as a service, I'd probably use an existing search provider. If such a provider served ads, I wouldn't love that but I'd understand that it needs to be paid for, somehow. Speaking generally? There very well might be a market for such a service. |
Hi,
(sorry, this one will be longish, but I want to make my points clear…,)
having been in the process of transitioning from WordPress to some static (not just flat-file) blog for years now (I’m really lazy), I still haven’t settled to which system to use. Actually, I had found one which I thought would be perfect, then I noticed that the provided solution for searching articles was not working as intended, especially since there was no way to use it without JavaScript. The most important feature of a blog system is a good search functionality, followed by a decent comment solution (but that’s a different thing).
So I’m back on track, looking for the perfect static blog solution. I already have a list of such systems which failed to work well for me (mostly theming- or feature-related issues), so I loosened my requirements a bit. I don’t even care which programming language is used anymore as long as it just works (as in it provides a good search function) and it’s a cool one (as in it’s not JavaScript).
Now here’s what I want:
The perfect static blog solution should, while generating the pages, keep some full-text index of the posts and provide a search function which could be accessed through the front-end (like the article listing but filtered by contents). In case this is already possible, please tell me how - I actually searched the docs and sources but I haven’t found such a functionality.
As I want you to actually consider this - I know - frequent feature wish within the near future, I posted it to several interesting generators’ issue trackers, including yours. I’ll probably use the Static Site Generator which comes up with a sufficient search functionality first.
Thank you in advance.
The text was updated successfully, but these errors were encountered: