We have a server running a Python application in Apache with mod_wsgi. In httpd.conf we configured 4 processes to run this application, each with multiple threads.
This configuration caused us some troubles. We extensively used memory to cache some results from very complicated and time-consuming computation and also some key-value pairs for efficient reading, and since different processes don’t share memory, we have multiple memory copies of the same data. That is, we need around 4x memory of the cache size. Of course, this can be solved by using something like memcached.
Another problem is logging. We are using Python’s built-in logging module to do this job (we needed extra logging besides Apache’s accesslog). At first we used the basic FileHandler and everything seemed OK. Later we decided to use TimedRotatingFileHandler to rotate the log at midnight so that we can analyze the logs daily. But on the second day, we found that the previous day’s log was missing. For example, today is 2011-08-10 and we expect to see yesterday’s log in file “customlog-2011-08-09”, but this file is full of today’s log entries! Both files “customlog” (the non-rotated file generated today) and “customlog-2011-08-09” are growing, which means that they are both being appended by processes.
Have a look at TimedRotatingFileHandler’s doRollover method:
t = self.rolloverAt - self.interval if self.utc: timeTuple = time.gmtime(t) else: timeTuple = time.localtime(t) dfn = self.baseFilename + "." + time.strftime(self.suffix, timeTuple) if os.path.exists(dfn): os.remove(dfn) os.rename(self.baseFilename, dfn) ...
Before renaming the “customlog” file, it first detected if “customlog-2011-08-09” existed. The first process (A) went through normally, renamed it to “customlog-2011-08-09”, and craeted a new “customlog” file to write new logs. When the second process (B) wanted to write its first log entry after midnight, it found that “customlog-2011-08-09” already existed, so it deleted that file (yesterday’s log was destroyed), rename the “customlog” which was just created by A to “customlog-2011-08-09”. Note that A still continued to write to that file, no matter it’s renamed. So at that time A and B were writing to different log files. And then came C, D… They all went through the detect-delete-rename process. What a mess.
Multiprocessing vs. multithreading
Obviously if we had only one process, we would not have run into this mess. Whatever we cache, we only have one copy of it in memory. So I wondered, if multithreading is enough, why bother with multiprocessing? I asked this question on Stack Overflow and got two answers. The answer from Graham indicates that using multiple processes does yield better performance than multithreading.
So let’s think about some solutions.
Write separate files
We can use the current process id as part of the log file name so that different processes will write to different log files and they won’t interfere with each other.
But when you restart the webserver, the processes will get new ids and will write to new files instead of appending to the old logs. And these old log files will never be renamed to a timestamped style. To fix this, we can write a shell script.
Combine WatchedFileHandler with TimedRotatingFileHandler?
Here is WatchedFileHandler’s emit method:
def emit(self, record): """ Emit a record. First check if the underlying file has changed, and if it has, close the old stream and reopen the file to get the current stream. """ if not os.path.exists(self.baseFilename): stat = None changed = 1 else: stat = os.stat(self.baseFilename) changed = (stat[ST_DEV] != self.dev) or (stat[ST_INO] != self.ino) if changed and self.stream is not None: self.stream.flush() self.stream.close() self.stream = self._open() if stat is None: stat = os.stat(self.baseFilename) self.dev, self.ino = stat[ST_DEV], stat[ST_INO] logging.FileHandler.emit(self, record)
I once thought of combining this idea with TimedRotatingFileHandler but soon realized that locks must be used and it could be complicated.
Just use WatchedFileHandler
One method is to use only WatchedFileHandler, and set up a cron job which renames the file at midnight. All processes will notice that the file attribute has changed so they’ll write to a new file.
Actually I didn’t try this so I don’t know if it will work.
At least we need to guarantee each log line is smaller than 4K. See the discussion here. It also showed another approach –
Send logs to a single process
Python logging provides a SocketHandler, which means you can send logs through a network. Python’s documentation gives an example implementation of the log server. Someone also wrote a server based on Twisted framework (event driven, non-blocking).
This way we have to maintain another server in addition to the web server. What if this logging server process crashes?
Send logs to web server
Actually it’s still sending logs to a single process. mod_wsgi provides some method to print messages to Apache error log. But this way is definitely not recommended because the application logs will be messed up with Apache’s error logs.
I personally prefer single process with multithreading to multiprocess solutions. Graham’s answer mentioned that since part of the server processing is not restricted by Python’s GIL, multithreading still makes partial use of multi-core architecture. If I had a site with high volume of traffic, I would use a single process and watch the load of each CPU core.