Thursday, October 20, 2011

HTSQL answers

HTSQL slides

Thanks to the great audiences at my HTSQL talks at Ohio Linuxfest and Dayton IEEE! (And to the folks who will come see me at Columbus Code Camp on Saturday.) I've promised you answers to some of your questions that stumped me, and (for OLF people) been criminally slow at getting them to you. So here you are! Some of the answers come from my own research, but I've also inserted quotes directly from the creator of HTSQL, Clark Evans... the embedded quotes are from Clark.

1. How do you restrict access via HTSQL?

http://htsql.org/doc/install.html#security

First, consider carefully which database user account you use to run htsql-ctl serve, and assign only the rights that user (representing your HTSQL users) should legitimately have.

Second, you can (and probably should) close down port 8080 (or whatever port you're serving HTSQL on) on your machine's firewall, and route all traffic through a webserver like Apache. (My HTRAF directions tell how to do that.) Then you can apply whatever authentication, IP limits, etc. you need at the webserver level.

If you need multiple groups to access your data with varying levels of permission, it's easy to run multiple instances of HTSQL as multiple
database users, route those instances through Apache, and restrict them at the Apache level appropriately.
This is a great answer. A few more items:

If the database is static (updated periodically), want to put varnish or something on the front. When you make another "data push" you could run though common queries to warm the cache. This is what we do for demo.htsql.org so that queries in our tutorial don't even hit the server.

For PostgreSQL, there is also a ``select timeout`` you can set using the "tweak.timeout" plugin, it can help a little bit with accidental denial of service. Basically, it cancels a query if your query runs over a particular number of seconds. If other databases have this ability, we could add a similar feature.

There is a also a ``autolimit`` that you can apply, this adds a LIMIT X to every query. In the current HTSQL implementation, all the results have to fit into memory: so, you can kill the backend process by creating a large result set. We'll fix this problem sometime next year... if you have a friendly audience, this generally isn't a problem. This plugin helps ensure users don't "accidently" create a big result though.

Even with these two enabled, you can still make queries that bring down either the HTSQL server (via memory exhaustion on big result) or the Database (via memory or cpu denial). So, some caution is advised if you give *direct* HTSQL access since you're letting arbitrary queries be created and such.

One solid way to handle this is separate the "trusted" users who need to create queries from "untrusted" users who are just running canned reports and dashboards. There's a "ssi" demo for doing this and we'll improve on it later. Basically, you have .htsql files server side with canned queries in them. You then limit users to only access .htsql saved queries. It'd be great to have this more automated... the demo code is just that: a proof of concept.

2. How can you paginate results?

You can request a "page" of results with HTSQL's limit() function. The optional second argument is an offset:

http://demo.htsql.org/course.limit(10)
http://demo.htsql.org/course.limit(10,10)
http://demo.htsql.org/course.limit(10,20)

There is a tweak.autolimit available to keep users from killing off their browsers with mistakenly broad queries. For example:

htsql-ctl -E tweak.autolimit:limit=10 serve pgsql://user:pwd@host/database

Users may also want to consider a browser like Chrome, where a runaway tab won't lock up the entire browser.

There's no way currently to have HTML results automatically insert "Next Page"-type links. Keep in mind that users aren't likely to genuinely want to page manually through very long result sets anyway; they'd probably be better off narrowing their queries rather than searching long lists by eyeball.

There are plans for HTRAF to generate automatically paginated tables at some point in the future.

3. Performance-wise, how does HTSQL respond under intense loads?

The best way to perform under intense loads is to not perform at all (see varnish above). The HTSQL server is also stateless: you can load balance as many copies as you need to meet demand.

As far as a single-process goes, the time spent converting HTSQL-to-SQL isn't really that significant compared to the query execution. The result handling isn't bad... although we've not done any load testing.

The SQL generation is probably the most important part. I think HTSQL generates quite clear/clean SQL, although your mileage may vary. To make any real judgments you need test queries and test data and profile it. Often times a 30 line SQL query that looks a bit ugly/repetitive will outperform a "hand-optimized" SQL query that is 12 lines.

We (Prometheus Research) [would] be delighted to help figure out why HTSQL performs badly if you have a specific query and test data set for us!

Anecdotally, for big (1-3 page) queries we have converted from SQL to HTSQL, the HTSQL equivalent is ~40% less source code and an order of magnitude more readable and maintainable. The SQL we then generate is typically bigger, sometimes 2x the size, but so far, the performance in the samples we have has been as-good-if-not-better than the original.

4. What about outer joins in HTSQL?

All HTSQL joins are LEFT OUTER joins - rows from the "driving" table are always included in the results, whether or not there are also records in the joined tables... it's a natural consequence of the driving table always determining the size of the result set. (If you specifically want to exclude rows that don't have counterparts in the joined tables, you can use ?exists(joined-table). More about that here.

No comments: