#POSTGIS IRC Log - 2008-08-03

For logs after Feb 3, 2007, all times are GMT-8. Prior logs are GMT-9.
Back to Logs
05:14:25 milovanderlinden: Paul Ramsey pointed to a nice presentation about mapserver ( the presentation with the oxes? ) anyone knows a url to that?
05:24:24 mloskot: milovanderlinden: http://s3.cleverelephant.ca/geoweb-mapserver.pdf
05:37:00 milovanderlinden: mloskot: Great! Thanks for the link!
07:06:09 CIA-6: pramsey * r2913 /trunk/ ( 6 files in 2 dirs ): Spelling changes from Mark Kolybabi.
11:03:06 iggy: so, I've got a rather large database of addresses with lat/lon and I've added a geom column with POINTs for the lat lon, I've got indexes on the zip column and on the geom column, but querying for a Buffer around a point still takes nearly 30 minutes
11:04:49 iggy: I need it to be a little faster
11:04:55 iggy: any ideas?
11:33:47 jlivni: iggy: paste your sql query
11:34:09 jlivni: hint: if you haven't indexed on the buffer, then it's not using the index
11:36:33 iggy: select count( * ) from addresses where zip = '77006' and addr_point && ST_Buffer( ST_SetSRID( ST_MakePoint( 29.742028, -095.399748 ), 32661 ), .0014482259 );
11:36:51 iggy: the buffer is what I'm searching for, how would I index that?
11:38:05 jlivni: wow thats a tiny buffer
11:38:51 jlivni: epsg:32661 is in meters, no? your buffer untis should be meters too. thats aside the point of qry speed tho
11:39:20 iggy: yes, I'm pretty sure I'm using meters
11:39:35 jlivni: try: select count( * ) from addresses where addr_point && st_expand( ST_SetSRID( ST_MakePoint( 29.742028, -095.399748 ), 32661 ), .0014482259 )
11:39:43 jlivni: you're buffering by .0014 meters
11:39:43 iggy: that should be about .01 miles converted to meters
11:39:46 jlivni: which is not a big space
11:39:55 iggy: oh
11:39:57 iggy: huh
11:40:05 iggy: can't be
11:40:14 jlivni: select st_area( ST_MakePoint( 29.742028, -095.399748 ), 32661 ), .0014482259 )
11:40:27 jlivni: that should be results in sq. meters
11:40:30 iggy: that query returns 144 addresses
11:41:07 jlivni: well, you haven't asked to only find addresses within the buffer - just those that match the bbox ( rtree index ) of your buffered pt
11:41:48 iggy: I think my head will explode soon
11:41:56 jlivni: you probably want: select * from addresses where st_within( addr_point, somebuffer ) and addr_point && somebuffer
11:42:29 jlivni: oh wait. i am sorry - i am lying to you a little here, i forgot the st_stuff negates the need for the whole && business. please ignore my above - i am just waking up
11:42:32 jlivni: all you need is st_within
11:42:38 jlivni: it will do the index automatically if you ahve an index
11:43:10 jlivni: so try just select * from addresses where st_within( addr_point,yourbuffer )
11:43:29 jlivni: where yourbuffer is the whole st_buffer( st_setsrid.... ) )
11:43:58 jlivni: and of course we are assuming you have a GiST index ( not btree ) on yuor addr_point geometry field
11:44:05 jlivni: and that you ran vacuum analyze
11:44:31 jlivni: it should be very quick
11:45:13 jlivni: if it is slow, please try replace the st_buffer with st_expand and see if it is a lot quicker
11:46:44 iggy: I'll run a vacuum analyze real quick ( well, relatively speaking, 200 mil addresses doesn't do anything quick I think ), I've run it a couple of times, but I don't remember if I did it after my last changes to the table
11:47:01 iggy: then I'll try your other suggestions and let you know
11:47:53 jlivni: as long as your last changes weren't adding the rtree index, or adding a relatively huge amt of new pts, you should not need to rerun
11:48:59 iggy: the last changes were adding the regular index on the zip column and the gist index on the addr_point column
11:49:41 jlivni: yeh, you want to run vacuum anaylze after adding the gist index
11:50:16 jlivni: given your # of points, that might be the prob
11:50:53 jlivni: back to the buffer, note that select st_area( st_buffer( st_setsrid( ST_MakePoint( 29.742028, -095.399748 ), 32661 ), .0014482259 ) ) is a very small area ( result in meters )
11:51:13 iggy: for some reason I was thinking it was km
11:51:41 jlivni: no - results are always in units of the projection
11:51:43 iggy: since it was returning insane numbers when I tried making it meters
11:52:04 jlivni: i am assuming your points are also in the same projection, not in lat/lng?
11:52:16 iggy: they are lat/lon
11:52:16 jlivni: or, rather, not in geographic coords
11:52:23 jlivni: then you won't get a match...
11:52:43 jlivni: your buffer also needs to be in the same projection
11:53:08 iggy: I don't get it
11:53:27 jlivni: well, you are trying to compare location in projection A with location in projection B. that won't work
11:53:34 jlivni: you need to transform one of them to the others projection
11:54:20 iggy: I don't follow
11:54:51 iggy: how can you convert a radius around an address into lat/lon coordinates
11:55:20 iggy: or I guess I thought that's what st_buffer did
11:55:57 iggy: and how could you have points designated by anything other than coordinates?
11:56:06 jlivni: st_buffer just gives you the buffer of the coords you are in ... wait a sec. i am cofusing you again for no reason
11:56:18 jlivni: why are you using st_setsrid in there
11:56:27 jlivni: let me see your orig query
11:56:33 iggy: becase it complained that my srid's were different
11:56:38 jlivni: yeah
11:56:39 jlivni: they are
11:56:43 jlivni: your points are probably in srid 4326
11:56:55 iggy: select count( * ) from addresses where zip = '77006' and addr_point && ST_Buffer( ST_MakePoint( 29.742028, -095.399748 ), .0014482259 );
11:57:17 jlivni: so you need to setsrid to 4326, not to 32661 or whatever it was
11:57:21 iggy: update addresses set addr_point = GeomFromText( 'POINT( lat lon )', 32661 );
11:57:46 jlivni: where are you getting 32661 from?
11:58:08 iggy: when I was testing that was the only thing I could get to work
11:58:31 jlivni: ok .. but 32661 refers to a projection that is in meters -- how did you decide to use 32661 instead of say, 4326?
11:59:08 iggy: because out of the 14 or so that I tried, that was the first one that worked
11:59:27 iggy: 4326 didn't work
11:59:29 jlivni: ah .. guess and check method. 32661 isnt well defined for most of the earth http://spatialreference.org/ref/epsg/32661/
11:59:30 sigq: Title: WGS 84 / UPS North: EPSG Projection -- Spatial Reference ( at spatialreference.org )
11:59:49 jlivni: whats your error with 4326
12:00:26 iggy: I'm guessing with what you told me the 4326 wasn't working because they were different projections or something
12:00:38 jlivni: well anyway, for this particular case of checking within a buffer, using the totally wrong projection MAY not be an issue. i would recommend getting your points in there correctly, and then running the right qry
12:00:50 jlivni: yes - that is why my stuff was not working, i was assuming, incorrectly, your points were in the database as 4326
12:01:11 jlivni: which they should be, assuming again of course that is your source data projection ( a reasonable guess )
12:01:24 iggy: they are US addresses
12:01:31 iggy: if that helps
12:01:49 jlivni: well, that doesn't mean anything regarding the projection of the lat/lng, but again i'm making a reasonable guess.
12:01:56 jlivni: certainly they're not in 32661
12:02:15 jlivni: you should be able to add them doing GeomFromText( 'POINT( lat lon )', 4326 )
12:02:53 iggy: but I can just change the srid in geometry_columns and yeah, just update the points the same way I did the first time
12:03:24 jlivni: yes - you should do that
12:03:39 jlivni: then you can run the qry using 4326
12:03:51 iggy: should I drop the gist index first?
12:03:54 jlivni: this is probably unrelated to your speed issue, but given you have a problem it is best to get things set up right from the start
12:04:03 iggy: agreed
12:04:05 jlivni: yes, you will need to recreate it anyway, since all your geoms will have changed
12:05:10 jlivni: i dont know if geomfromtext( .. ) is faster or slower than setsrid( st_makepoint( .. ) ) but if it takes a long time to update your addresses you could test a little and see if one saves much time over the other
12:05:50 jlivni: anyway. after updating, create the GiST index, and then run vacuum analyze, and then you should have quick results
12:06:26 iggy: yeah, it took nearly 2 hours last time to update the addr_point column
12:06:48 iggy: I've still got my test database sitting around that is only 50k or so records that I can test on
12:06:53 iggy: it'll be a bit faster
12:08:47 jlivni: i've gotten subsecond results with ~5M pts ... so 30 minutes is just WAY off and i think it wasn't using your indexes right for some reason. definitely check back here if you have troubles after getting your srids in order
12:09:20 jlivni: remember if you need it within your actual buffer, use st_within
12:10:47 jlivni: as an aside, i find that "st_distance( yourpts,buffer ) = 0" is also slightly faster than "st_within( yourpts, buffer )", but not to a huge degree
12:19:58 iggy: I just gave it about 8+ hours of work to do, so you probably won't hear back from me till tomorrow
15:20:55 iggy: jlivni: on my test data, I went through and made all those changes we talked about earlier, but I still have to set the radius of st_expand/st_buffer to that very small number
15:22:21 iggy: and I remember where that number came from, it's $miles / 69.05
15:23:55 iggy: so it's supposed to represent .1 miles in degrees
15:48:54 jlivni: yes - thats true now because you are using epsg:4326 which is in decimal degrees. previously you said ( unintentially ) your data was in units of meters
15:49:21 jlivni: note that distance in decimal degrees varies depending where you are, so it's not at all exact
15:49:45 jlivni: maybe +/- 30% difference north vs south of US ... something like that
15:50:09 jlivni: so if you need it exact, transform to something other than decimal degrees, but anyway focus on that after getting everything fast
15:51:21 iggy: I'm really starting to admire the people that do this for a living and know what they are doing
15:51:37 iggy: like my brother does this stuff all day long for some oil company
15:52:45 iggy: in any case, I was having to use that small number before because I thought it was in degrees too
15:53:32 iggy: I get the same results with both srids
15:55:25 iggy: I'm also wondering if there are memory settings I can change to help out
15:55:53 iggy: the biggest postgres process is only using ~40M of memory
15:56:12 iggy: I haven't found much info about tuning the memory settings in postgres
15:56:28 tazle: iggy: what are you doing?
15:57:00 iggy: pretending to be a dba with a geographical specialty
15:57:36 iggy: I have a database with 200 million addresses in it that I need to query to find addrs within a radius of other addrs
15:58:31 iggy: my first try took over 30mins to run one query
15:58:45 iggy: which isn't really going to work for what we are trying to do
15:59:57 tazle: did you already get the quer to use the index?
16:00:09 iggy: that's what I'm waiting on right now
16:00:15 tazle: "waiting"?
16:00:21 iggy: we think it wasn't before
16:00:38 tazle: use explain to see the query plan - there's no waiting to be done
16:00:39 iggy: well, it takes a little while to index 200 mil records
16:00:53 tazle: oh, you're reindexing
16:00:59 iggy: oh, I'm redoing the index because I had messed up something else
16:01:22 tazle: ah right, the SRID error
16:02:41 iggy: yeah, so I dropped the geom column, recreated it, refilled it using the lat and lon columns, now it is indexing, then it has to do a vacuum analyze
16:02:51 iggy: then I'll know whether it helps
16:03:20 iggy: this database is pretty large
16:05:11 iggy: it was sitting at 68G now it says it's 204G... I have the feeling that I'm going to have to reload all the data from scratch
16:11:28 darkblue_B: hmm thats bigger than OpenStreetMap
16:12:40 darkblue_B: sensor data.. or cell phones or somesuch, I'd guess
16:13:45 tazle: well, if it's one row per address, it sounds reasonable
16:14:01 tazle: however, that's a silly way to store addresses :P
16:14:53 tazle: well, a silly way to store just addresses anyway - if there's some other data involved it's probably not so silly
16:15:35 iggy: there's other data
16:15:46 iggy: demographics, names, phone numbers, etc
16:20:13 iggy: tazle: what would be a better way to store just addresses?
16:21:14 tazle: iggy: normally addresses are stored by storing the first and last address numbers on each side of a road segment
16:21:55 darkblue_B: like TIGER
16:23:36 tazle: iggy: however, that is really useful just for converting an address text into a point, or a point into nearest address - I'm not sure if you're trying to do either
16:23:47 iggy: interesting
16:25:09 iggy: we have users that go on to our site, they type in an address, we look that up and get the lat/lon then find either the closest N addresses or we find the number within a certain mile radius
16:27:17 tazle: iggy: but that is of addresses that you have some data for?
16:27:33 tazle: iggy: ie. not just any valid addresses
16:28:46 iggy: yes
16:29:23 iggy: it's very common that they will limit the results based off of demographics that we have
16:30:44 iggy: i.e. the 20 closest address to 123 anywhere street that the head of household makes more than 150k/yr and haven't moved within the last 3 years
16:31:07 iggy: it's a marketing tool for realtors
16:31:55 iggy: those goofy little cards you get in the mail trying to sell houses or letting you know that such and such realtor just sold a house in your neighborhood for the high price of X
16:32:40 iggy: and yes, I know I'm going to hell for helping them fill people's mail boxes with junk mail
16:48:28 darkblue_B: well.. targeted marketing mailers is annoying, but more efficeinet than blankets of *un*targeted marketing mailers :- )
16:49:18 iggy: all the docs I can find consider 4G of ram to be huge... I think they are a little outdated
16:49:18 darkblue_B: I'm interested in writing things like you are, but for sustainability/efficiency/carbon pfftprint reasons
16:49:41 darkblue_B: RAM is moving along
16:49:57 darkblue_B: there has been talk of caharcterizing your server usage.. for example
16:50:13 darkblue_B: you can pump up the RAM for a search, if its just you searching
16:50:30 darkblue_B: but probably a bad idea if you get a few dozen searches going at once, ever
16:50:33 iggy: unfortunately, this data ended up being a lot bigger than I thought, I was expecting ~30G of data, so the server has 48G
16:50:48 iggy: the data ended up being 70G, so it won't all fit in linux's fs cache
16:50:50 darkblue_B: 'cause you wil cause the server to begin disk thrashing
16:51:15 darkblue_B: you are in the 'data warehousing' size of DBs there I'd say
16:51:47 darkblue_B: so.. at OSCon I went to one of the Pg data warehousng talks
16:52:10 darkblue_B: they said that for their system, they built a queue, so that no more than one query can execute, ever
16:52:24 iggy: we have a seperate server that does all our transactional stuff ( i.e. people buying this data that we lookup, etc. )
16:52:25 darkblue_B: based on that, they can tune their memory footprint
16:52:37 iggy: I considered that
16:53:14 darkblue_B: it seems worthwhile for your situation
16:53:53 darkblue_B: there is a Pg day on Tuesday at Linux World SF
16:54:02 darkblue_B: if you are in the California Bay Are
16:54:05 darkblue_B: Area
16:54:35 darkblue_B: Pg is holding its own against MySQL for just this sort of purpose
16:54:41 iggy: in texas unfortunately
16:54:43 darkblue_B: .. by any objective measure
16:55:08 darkblue_B: ah so.. well IRC it is then
16:55:15 iggy: from what I saw when doing my research pg was doing better than mysql for this type of stuff
16:55:21 darkblue_B: yes
16:55:26 iggy: that's why I went with it
16:55:34 darkblue_B: confirmed by industry gossip ;- )
16:55:52 iggy: our other db is actually mysql because that's what everybody was familiar with
16:56:17 darkblue_B: super sized servers... strict data constraints.. fancy DB language tricks.. and &^&*( ^* huge datasets
16:56:38 darkblue_B: MySQL is catching up on a lot of other fronts..
16:57:09 darkblue_B: ..and of course, PostGIS.. which is neat-o p )
16:57:38 iggy: seemed to be more mature than mysql's spatial stuff ( although I'm not sure which is actually older )
16:57:49 darkblue_B: indeed it is
16:58:12 darkblue_B: I dont gather that spatal is a big target for MySQL now.. I saw no menton of it in 6.x stuff
16:58:28 darkblue_B: PostGIS works rather well
16:59:23 darkblue_B: iggy: you can see some things I have written here htt://24.7.79.159/portfolioMisc/
16:59:30 darkblue_B: http
16:59:52 darkblue_B: also the video "Dynamic KML" filed at Google
16:59:56 darkblue_B: filmed
17:03:02 iggy: darkblue_B: nice, lots of stuff there that I'll be reading for the next week or so
17:03:24 iggy: should print some of it out for my flight Wed.
17:03:38 tazle: I'm not sure if fitting all the data into memory is that big of an issue, as long as you can fit all the necessary indexes in memory
17:04:19 darkblue_B: good point tazle
17:05:20 iggy: which it may not considering my dataset is ~70G and the tablespace dir with indexes is ~208G so far
17:13:14 tazle: did you already vacuum the old index away?
17:13:51 iggy: nein, it's indexing now then it will vacuum analyze
17:16:48 tazle: well, given some characteristics of the tree used for indexing, and the number of objects, the space taken by the index would fairly easy to estimate
17:17:35 iggy: I'm patient enough to wait
17:17:46 iggy: any more knowledge and my head might explode
18:14:26 jlivni: iggy: your db size will go down after vacuum ( actually a vacuum full will definitely compact it .. i think vacuum analyze does too but not sure offhand )
18:14:57 jlivni: once indexing is done, try your query. it should be fast. if its not, repost your sql and someone will be able to help i am sure
18:15:40 jlivni: iggy: you can read the postgis tuning guide, an old copy of which is here http://postgis.refractions.net/pipermail/postgis-users/2006-March/011539.html for hints on upping mem for your use case
18:15:42 sigq: Title: [postgis-users] Problem with Intersection ( at postgis.refractions.net )
18:16:59 iggy: jlivni: thanks for that url, looks very helpful
18:17:36 iggy: I'm sure you'll hear from me every once in a while now that I've found out this place exists
19:10:06 bitnerd: jlivni, iggy vacuum and vacuum analyze just mark areas to be reused -- it takes a vacuum full to actually compact the db
19:10:17 iggy: rgr
19:12:20 bitnerd: http://www.postgresql.org/docs/8.3/static/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY
19:12:22 sigq: Title: PostgreSQL: Documentation: Manuals: PostgreSQL 8.3: Routine Vacuuming ( at www.postgresql.org )
19:14:42 iggy: I think I'm going to try clustering based on the zip code too
19:16:47 bitnerd: clustering should do the same as vacuum full too
19:17:11 iggy: that's what the documentation seemed to indicate
19:21:00 jlivni: bitnerd: ah right, thx for clarifying that.
21:31:07 darkblue_B: TheSteveMonkey: hey