We have a server running a Python application in Apache with mod_wsgi. In httpd.conf we configured 4 processes to run this application, each with multiple threads.
This configuration caused us some troubles. We extensively used memory to cache some results from very complicated and time-consuming computation and also some key-value pairs for efficient reading, and since different processes don’t share memory, we have multiple memory copies of the same data. That is, we need around 4x memory of the cache size. Of course, this can be solved by using something like memcached.
Another problem is logging. We are using Python’s built-in logging module to do this job (we needed extra logging besides Apache’s accesslog). At first we used the basic FileHandler and everything seemed OK. Later we decided to use TimedRotatingFileHandler to rotate the log at midnight so that we can analyze the logs daily. But on the second day, we found that the previous day’s log was missing. For example, today is 2011-08-10 and we expect to see yesterday’s log in file “customlog-2011-08-09”, but this file is full of today’s log entries! Both files “customlog” (the non-rotated file generated today) and “customlog-2011-08-09” are growing, which means that they are both being appended by processes.
Continue reading Python Logging from Multiple Processes
Many sites use Facebook Connect today, hoping to share the popularity of Facebook, pulling Facebook users’ profile/feed, and publishing their own site-generated contents to users’ feed/wall. So most Facebook Graph tutorials, including Facebook’s own documentation, tell us how to add Facebook Connect button to our sites, how to set the redirect URL, how to grab the access token, and how to request for users’ privacy.
But I wanted to rely on Facebook instead of implementing my own user login/registration processes. I’m still doing investigation and I don’t have a working example yet so this post is just my rough thoughts. I will come back and revise it when I get the implementation experience.
Continue reading Using Facebook (and Other Sites) as User Authentication System
Yesterday a friend told me that he could not use wget to download a web page, which was protected by HTTP authentication.
HTTP/1.1 has two auth methods – Basic and Digest. Basic auth just sends the username and password in plain text and Digest sends a hashed password.
Basically the process involves two round trips. The client first requests for the resource and the server returns a 401 response so the client now knows that authentication is required (The server tells the auth method, Basic or Digest, in the response header). Then the client sends the same request again, but adding an “Authorization” header field this time. The server checks this header line and if authentication is successful, responses with the correct resource and a 200 status code. But if authentication fails, it will send another 401 response. Normally the client stops trying and tells user about the failure.
Continue reading wget and HTTP Digest Authentication
Earlier today, I wanted to setup a PHP development environment on my MacBook Pro. It was expected to be very simple. As many tutorials said, I just uncommented the “LoadModule php5_module libexec/apache2/libphp5.so”, enabled web sharing in system preferences, and expected to see PHP working by visiting http://localhost/~myusername/.
But no luck. Firefox just can’t connect to the web server. Then instead clicking the checkbox in system preferences window, I tried to type commands in Terminal – “sudo apachectl start”. But the execution of the command just completed with no output or any warnings. Also, nothing appeared in /var/log/apache2/error_log. That’s really weird – on Linux platform, there should be something if the server fails to start. But by appending an argument to the command:
Continue reading Mac OS X 10.6.5 PHP Problem
Having used fastcgi in lighttpd for a while, I wondered if there was a way to restart one fastcgi site/application without affecting other sites. Because I have many fastcgi sites hosted in a single lighttpd instance, and I don’t want them all to be shut down for a little while each time I want to restart a site after deployment.
The solution turned out to be very simple –
- Find out the process of the application that you want to restart and kill it.
- Access your web site from a web browser and the restart is done.
Continue reading Lighttpd fastcgi – How to Restart a Specific Site Without Affecting Others
Sometimes you need to crawl a web page and extract specific data from it. It doesn’t have to be “evil” things like crawling competitors’ data. For example, you may want to do periodical checks for unreachable links in your own web pages – extract all links on the page and see if they’re still accessible.
For extracting specific parts from some text, we tend to think about regular expressions first. For example, the regular expression for extracting links might look like:
Continue reading Ruby HTML Parsers