Archive for the 'Work' Category

Goodbye image servers

Saying goodbye tio a departed image server

A week ago tomorrow IDX decommissioned the last of its image servers. Over the last 2 and a half months I migrated a little over 20 million images, about 480 gigabytes, from our severs to Amazon’s S3 service. Most of that time was spend just occasionally checking in on the migration scripts that I had written or rewriting our image acquisition scripts to work with S3. We download images from about 190 sources every night as we gather MLS data on behalf of our clients.

The best part of the whole image migration and overhaul is that image acquisition is now tied into into our data balancer system. Each MLS in our system has a time stored in the database that is the earliest we can reliably download data from that source. Once we reach that time in the day the MLS goes through a series of steps triggered by a cronjob that runs once a minute. First the data is downloaded from what ever source makes it available. This can be ftp, http, soap, rets, or even direct sql connections. Next the data is parsed and made ready for insertion into our database. Once processing is done the data is geocoded so that we can easily map all the properties.

This was where the process stopped. When the app was first written image scripting was rushed as we were trying to meat our launch deadline. The image scripts were on different servers, so the data balancer couldn’t act on them directly. Instead each was launched as its own cronjob on one of the image servers. Every MLS is unique in the way we acquire images and is constantly changing, as such each must have its own acquisition script. Now each of those scripts is defined in our database.

Once per minute a script runs on our EC2 server that looks for image ready flags in our data balancing system. When it finds one it checks the database for the specific file that should be run to gather images. The script runs and then resets the image ready flag. As with data our image sources are varied. In some cases we generate URLs based on a know syntax, in some cases we’re given URLs by the MLS. In this cases we don’t need to store anything. Often we get images from some FTP source, via RETS, or in one case we download binary stored as BLOBs on a remote SQL server. Needless to say it’s complex to get all these images from 190 disparate sources, so anything we can do to automate things better is good.

My next project is building WSDL web services using NuSoap. This is uncharted territory for me, so I’m sure I’ll have more to say on this subject later.



I spoke a bit to soon

This post is a bit late as it describes things that happened last Thursday, but it was a busy weekend and all my “posting to stuff” energy got sucked up by twitter.

I knew it was risky to make a self congratulatory post about a feature that had just launched. All and all it was pretty successful and took a deent load off the server, but two bugs cropped up that were sever enough to force me to revert the code.

The first issue will be easy enough to fix. Our app has several features, including a property slideshow, that are called remotely via javascript includes which also rely on the results class. Because the caching mechanism needs a PHP session id to avoid having one user contaminating another’s search results these tools stopped returning any properties. Luckely I wrote the constructor of the results class to have an all purpose override array as one of it’s parameters. So all I need to do to fix this issue is to generate a sessionID for the javascript includes to pass through to the results class and they should work again.

The second issue is going to take more work and creativity. When I worte the results class I built out the featured properties function to be generic. My thinking was that any property lising that is owned by one or our clients is a featured property and thus belongs on the featured properties page. Our clients, however, seem to have disagreed. They’ve figured out that they can append search variables to their featured property URLs and do things like make featured properties pages that are only for million+ dollar homes, or just commercial listings, or… whatever they want. This is all fine and good when the featured property search is repoerformed each time the page is called and is user agnostic, but not so much when my caching mechaism was in place. The caching mecanism treats all featured properties searches the same effectively ignoring any search terms added on.

I reverted the the previous version of the results class from our SVN repository and all went back to normal. Once I tie up these last couple loose ends I’ll be able to push the caching mechanism back out as part of Wednesday’s doubledot release. Here’s hoping it goes better this time. The caching mechanism did seem to take a noticeable load off the server, so it seems like a worthy endeavor to retry.



Fun with caching

In the last couple of days I did some work to complicate the IDX application a bit. I applied the patch today that contained the changes and so far all seems well. Here’s the story.

About nine months ago I completed a reworking (aka complete rewrite from the ground up) of the application’s results class. This is the code that assembles all the properties that meet the criteria of the search that has been performed and makes them available for what every they need to do. Once all the various data tables had been queried the matching results were placed in a temporary heap table so that they could be sorted, filtered (based on client preferences and/or MLS rules), and truncated if need be. I decided to use temporary heap tables because they’re fast and since they’re session specific I knew that I wouldn’t have to worry about one user contaminating another’s results.

The system has been working beautifully for these last nine months but as our traffic has grown (now upwards of 44,000 hits a day) mySQL was having trouble keeping up. All the heap tables we using a lot of the server’s RAM and since the heap tables were being destroyed as soon as the page was delivered searches had to be rerun completely just to move from page to page.

Todays patched changed things. The heap tables are gone in favor of a searchCache table (one for each client in our system) where all search results end up. When the same search is run again (like when switching pages) the results can be pulled from the cache instead of all the data tables needing to be queried again. All results are tagged with the users PHP session ID to prevent result contamination and every 4 hours the cache is cleaned to prevent the tables from getting too large. Featured property searches are also cached in our system and, because they are the slowest queries we perform*, they are cached for 24 hours until we get new data.

I’m pleased so far. The patch was uploaded to our server 8 hours ago and thus far there are no reports of problems.

Thanks to bob the lomond for the photo.

*Featured results are the slowest because of the number of tables that have to be queried. Normal results only have to query 1 table per MLS being searched because they are property type specific. Featured properties are property type independent and thusly upwards of nine tables per MLS may need to be queried.



Want a job? Learn PHP.

It never ceases to surprise me just how hard it is to find decent PHP developers in Eugene, Or. I never intended to make a career out of PHP, I learned the language on a whim because I was tired of having to treat all my scripts as CGI as must be done with Perl on the web. Now I not only code PHP for a living but I spend part of my time looking for people with PHP skills.

My company is looking to fill 2 positions, Palo Alto has at least one open, EngineWorks recently moved to Portland in part for a better employee base, and I know at least one other company is looking too. I never would have thought that PHP would be the big in demand skill I would have. Glad I have that biology degree to fall back on.



Bishma FTW

Teh Winner

I beat Amazon!

Okay… not so much beat as figured out, and not so much Amazon as my own scripting. A mentioned in my last post I was having issues with corrupted images ongetting stored on S3 during my migration process. I determined that errors were being introduced during the transfers between our old image server via ftp AND during the REST upload to AWS.

I implemented the MD5 check I mentioned in my last post and added a step to the S3 upload. After transferring the file to S3 I perform a HEAD request on the object which sends my back a header containing, among other things, content-type and content length. I can then make sure that the content-length matches the size of the image I downloaded and that the content type is some type of image (useful since all errors are delivered as application/xml).

Little by little I’m developing a rock solid PHP class for S3 file handling.

Thanks to Lumaxart for the image




You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.