Today I needed to pull some web page down from the internet and extract some specific contents in PHP. Sounds like a crawler, huh? Actually not the real crawler, just pulling our own contents. I was doing this because it’s not convenient for me to access the database directly.
I’m not quite familiar with PHP, but with version 5 on my local dev machine, I was able to do this very quickly. Just use file_get_contents to get the whole page as a string, and then use preg_match_all to search for the parts I want.
Unexpected things happened after I uploaded the script to the server. It said function file_get_contents was not defined. Then I realized that I was on a machine with Red Hat 9, the PHP I was using was version 4.2.2 bundled with RH9. OK. I rewrote the code to use fopen/fread directly. This time, it complained that it couldn’t handle the scheme (I don’t remember the error report string clearly).
I don’t know if it was because of my configuration, or version 4.2.2 doesn’t support the wrappers. It made me crazy. I don’t want to do an upgrade because all the packages are old. It takes time and may cause more problems. I even couldn’t find the apxs binary to compile PHP from source.
Finally, I got a workaround. First use exec to call wget to download the url to a file in /tmp, and then use fopen/fread to read this temp file. It really works.
Another problem was that preg_match_all doesn’t accept the last $offset parameter in PHP 4.2.2, but it’s simple to fix, I think.
This took me some time, but made me realize that how the development of software/language tools eased our daily work.