Docker Development Environment for Everyone

One of the biggest challenges when collaborating with others in developing software and websites is setting up the development environment. The good ol “it works on my machine…” problem.

Well, this is no panacea for development, but it does a good job of setting up a basic environment pretty quickly.

You’re in for a special treat, because I’m going to show you not one (1), but two (2) different development environments; one for PHP, MySQL, Apache and phpMyAdmin, and one for Python (Flask) and PostgreSQL with pgAdmin. Each of these in a Docker container for ease of use.

Pre-requisites

For any of this to work, make sure you have Docker Desktop installed and running.

We’ll be using a terminal application for running some commands, so you’ll need some familiarity with that too.

Git is used to copy the files from the GitHub repo, but you can also download them as a zip file.

PMAMP

We’ll tackle the PhpMyadmin Apache Mysql Php (PMAMP) environment first.

After setting this up, we’ll have a place to put PHP code, a running Apache web server, a MySQL server and a running instance of phpMyAdmin.

The quickest way to get this going is to download the files from this GitHub repo https://github.com/ammonshepherd/pmamp

git clone https://github.com/ammonshepherd/pmamp.git

Change into that directory.

cd pmamp

And start the Docker containers

docker-compose up -d

You can view the website at http://lvh.me. lvh.me is just a nice service that points back to your local machine (127.0.0.1 or localhost). It makes it look like you are using a real domain name.

You can view phpMyAdmin at http://pma.lvh.me.

You can even use a real domain name. Just edit the docker-compose.yml file. There is a line like this:  

- "traefik.http.routers.php-apache.rule=Host('lvh.me', 'pmamp.lvh.me', 'example.com')"

Just add your domain to the list (or remove the other ones). Each entry must use the backtick, rather than the single quotes. WordPress mangles the backticks, so I am using single quotes here.

Now you just need to let your computer know to redirect all traffic to that domain name to itself.

You’ll need to edit the /etc/hosts file (Linux or Mac), or c:\windows\system32\drivers\etc\hosts (Windows). Now you can develop for any domain name right on your computer as if it were using the actual domain name.

Put all of your website files in the ‘www’ folder and you’re ready to develop!

Check the README at https://github.com/ammonshepherd/pmamp for more details on how it works and things to change.

To stop the services (turn off Apache, MySQL and phpAdmin) run

docker-compose down

in the same directory where the docker-compose.yml file lives.

pFp

The set up for Python (using a Flask app) and PostgreSQL is exactly the same process.

Grab the files from https://github.com/ammonshepherd/pfp.

git clone https://github.com/ammonshepherd/pfp.git

cd pfp

docker-compose up -d

You now have a running Flask app at http://lvh.me, or http://pfp.lvh.me and a running pgAdmin application at http://pga.lvh.me.

The same trick for custom domain names applies here too.

And also check out the README for more details: https://github.com/ammonshepherd/pfp

Follow the same commands above to shutdown the Python, PostgreSQL and pgAdmin containers.

Grab all of the domain names in Apache host file

Quick script I whipped up today to grab all of the domain names on a server.

#!/bin/bash

if [ -e alldomains ]
then
  rm alldomains
fi

alldomains=( $(find /etc/httpd/conf.vhosts/ -name *.conf) )

for domain in ${alldomains[*]}
do
  cat $domain | egrep "ServerName|ServerAlias" | egrep -v "#" | sed -e 's|ServerName||' -e 's|ServerAlias||' -e 's|www.||' -e 's|:80||' | tr -s ' ' '\n' | tr -d ' ' | sed -e '/^\s*$/d' >> alldomains
done

sort alldomains | uniq | sort -o alldomains

 

This gets all of the domains from ServerName and ServerAlias lines, takes out all of the white space and empty lines, and creates a file with just a list of the unique domain names.

This accounts for subdomains that use ‘www’ or have port :80 on the end.

For instance, www.somedomain.com and somedomain.com are the same, so the script takes out the ‘www.’ which leaves to copies of somedomain.com, which it then deletes one of them in the final output to the file. The same for ‘:80’.

 

Upgrading Omeka and Neatline

A first project at my new job at the Scholar’s Lab at UVA was to update some old Omeka/Neatline sites.

slab-logo-rgb-350px

I wrote a script to take care of the process now and in the future.

https://github.com/mossiso/onus

I perhaps went a little overboard and made it pretty robust. I was going to take the opportunity to learn some ruby, but ended up writing it in Bash. One thing I learned is that Bash does not handle comparing floating point numbers. And that was a bit part of the requirement.

I’ll run through how to use the script as well as go through some of the logic found in the script.

Running the Script

Rather than repeat everything on the github page, just take a look there for steps on how to set up and run the script.

Basically, you just run the script on the command line, it prompts for the path to the omeka install (or you can give it after the command), and automatically upgrades Omeka and Neatline to then next higher version number.

./onus.sh /path/to/omeka/install

You can add some flags/options to the command to upgrade Omeka and Neatline to a specific version, or exclude the upgrading and just make a backup copy of the database and files into a zip file.

About the Script

The purpose of the script is to upgrade Omeka and Neatline. One big problem arose when upgrading sites that were previous to 2.0.0.

Omeka and Neatline both go through some significant database (and code) changes from 1.5.x to 2.x.x. The biggest seemed to be that the upgrade script for Neatline didn’t “take” and needed to be done manually. Here are the steps to do that by hand (the script will take care of this if you use it).

Upgrading Omeka and Neatline from 1.5.x to 2.x.x

The first step is always to make a backup copy of the database and files. That way if anything goes awry, you can easily put things back together.

  1. To back up the database, simply take a MySQL dump.
    mysqldump -uuser -ppassword databasename > databasename.sql

    Do this in the main directory of Omeka. Then make a zip file of the entire Omeka directory.

    zip -r omeka-backup.zip /path/to/omeka/
  2. Next, deactivate any plugins you have installed, including Neatline and NeatlineMaps. One of the big changes with 2.x.x version is that NeatlineMaps is rolled into Neatline.
  3. Grab a 2.0.x version of OmekaEither do this with github
    git clone https://github.com/omeka/Omeka NewOmeka

    or with a zip file

    wget http://omeka.org/files/omeka-2.0.4.zipunzip omeka-2.0.4.zip
  4. Add the 2.0.0 version of Neatline plugin into the NewOmeka/plugins directory, along with any other plugins you may need. NeatlineText, NeatlineSimilie and NeatlineWaypoints may be needed if you used that functionality in the previous version.
  5. Copy the db.ini file from the old installation to the NewOmeka/ directory.
  6. Now load the admin page for NewOmeka/ in the browser: http://domain/NewOMeka/admin/. Upgrade the database and login to upgrade and reactivate the Neatline plugin and other plugins as needed.
  7. You may notice things go smoothly, except the existing Neatline exhibits may not transfer. To get them into the new database tables, add the following two lines at line 80 in the NewOmeka/plugins/Neatline/migrations/2.0.0/Neatline_Migration_200.php file:
    $fc = Zend_Registry::get("bootstrap")->getPluginResource("FrontController")->getFrontController();
    $fc->getRouter()->addDefaultRoutes();
  8. Run the following database command to allow the background process to run:
    mysql -uuser -ppassword database --execute="UPDATE prefix_processes SET status='starting' WHERE id=1;"

     

  9. Finally, run the following php command to get the processes started.
    /path/to/bin/php /path/to/NewOmeka/application/scripts/background.php -p 1

     

Some Script Logic

Initially, I used the script to upgrade both Omeka and Neatline to the next higher version, going through every single minor version incrementally. When upgrading from Omeka 1.5.1 and Neatline 1.0.0 to the latest versions (2.2.2 for Omeka and 2.3.0 for Neatline), I had to run the script over 20 times!

That was way too intensive, so next I added some logic to just skip to the next major release. That dropped the times needed to run the script down to four. But I could do better than that! I added in some command line options/flags that allow you to upgrade to any Omeka or Neatline version you specify. Now you can upgrade from Omeka 1.5.x and Neatline 1.x.x directly to Omeka 2.0.4 and Neatline 2.0.0, then right to Omeka 2.2.2 and Neatline 2.3.0. Two steps!

Bash and floating points

As mentioned above, Bash does not work with floating points, so I had to create a function to deal with that. Dealing with version numbers, especially with minor version numbers kind of requires the need to compare floating point numbers…

In the script I use two different functions:

# Compare two floating point numbers.
#
# Usage:
# result=$( compare_floats number1 number 2 )
# if $result ; then
# echo 'number1 is greater'
# else
# echo 'number2 is greater'
# fi
#
# result : the string 'true' or 'false'
# number1 : the first number to compare
# number2 : the second number to compare
function compare_floats() {
    echo | awk -v n1=$1 -v n2=$2 '{if (n1<n2) printf ("false"); else printf ("true");}'
}

This function basically compares two numbers. It outputs true if the first number is greater than the second number, and false if the first number is less than the second number. Another way to think about it would be, is the second number less than the first number?

# Pass the current version first, then the array
# the function echoes the version just greater than the current version,
# i.e., the next version to upgrade to.
#
# Usage:
# variable=$( get_next_version $num array[@] )
#
# variable : the next version greater than $num
# $num : the current version
# array[@] : an array of all possible versions
function get_next_version() {
    num=$1
    declare -a ARRAY=("${!2}")
    for i in ${ARRAY[@]}
    do
        if awk -v n1=$num -v n2=$i 'BEGIN{ if (n1<n2) exit 0; exit 1}'; then
            echo $i
            break
        else
            continue
        fi
    done
}

For this function, you pass the current version and an array of possible version numbers. The function will compare the number you pass it, compare it with the array, and echo the next highest number.

Both functions use the same awk command, but in a little different format. They test wether one number is greater than the other, and return ‘true’ or ‘false’.

Setting up a Hosting Environment, Part 5: Apache and PHP

Figuring out the possibilities for Apache and PHP reminds me of a Dr. Seuss book, “Fox in Sox”. It’s a favorite of mine. I love reading it to the kids. In it, Mr. Fox tries to get Mr. Knox to say all kinds of ridiculous (in meaning and hard to say) tongue twisters. At one point Mr. Knox exclaims:
“I can’t blab such blibber blubber!
My tongue isn’t make of rubber.”

That’s what my brain felt like after trying to figure all of the options for Apache and PHP. To combat my rubber brain, I created this flow-chart to help me keep track of the options, the pros and cons for each, and the path I finally chose.

First off, a list of requirements and goals:

  1. Chroot each vhost to it’s own directory, and have Apache and PHP run on that vhost’s server account
  2. Speed, run Apache and PHP at their most effective and efficient levels
  3. Utilize an opcode cache, APC, to speed up PHP pages
  4. Use trusted repositories to make installation and upgrading easier

Here’s what I eventually figured out about Apache and PHP:

ApachePHP
Click on the image to see a larger view

These sites were helpful for the initial set up of PHP as CGI with mod_fcgi and Apache in chroot (mod_fcgi sends one request to each PHP process regardless if PHP children are available to handle more, and no sharing of APC opcode cache across PHP processes):

This site was helpful for setting up PHP as CGI with mod_fastcgi and Apache in chroot (mod_fastcgi sends multiple requests to a PHP process, so the process can send them to children processes, and having one PHP process for each site allows for APC opcode cache to be usable.)

These sites helped me learn about php-fpm and how it is not quite ready for what I have in mind:

I ended up going with Apache’s mod_fastcgi for using PHP as a CGI, and NOT using PHP-FPM, while running Apache in threaded mode with apache.worker enabled.

To get this set up is pretty easy. I already had Apache and PHP installed and running (with PHP as CGI using mod_fcgi), so here are the steps I used to convert it to run mod_fastcgi and apache.worker. I’m running CentOS 6.3.

Install the RPMForge repo for installing mod_fastcgi.

  • Get latest from http://repoforge.org/use/ : rpm -Uvh http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
  • yum --enablerepo=rpmforge install mod_fastcgi

Edit the /etc/httpd/conf/httpd.conf file

  • ServerTokens Prod
  • KeepAlive On
  • Edit the worker section. I still need to do some testing to figure out the best configuration
    <IfModule worker.c>
        StartServers         8
        MaxClients         300
        MinSpareThreads     25
        MaxSpareThreads     75
        ThreadsPerChild     25
        MaxRequestsPerChild  0
    </IfModule>
  • If there, make sure to comment out, or delete the lines for mod_php: LoadModule php5_module modules/libphp5.so
  • this line also: AddType application/x-httpd-php .php
  • The last line should be: Include conf/virtual_hosts.conf

 

Create a /etc/httpd/conf/virtual_hosts.conf file

Each virtual host needs to have an entry similar to this in the httpd.conf file, or I like to create a separate virtual_host.conf and include that in the main httpd.conf.

# Name-based virtual hosts
#

# Default
NameVirtualHost *:80

# Begin domain-name.com section
<VirtualHost *:80>
    DocumentRoot /var/domain-name/home/html/
    ServerName domain-name.com
    ServerAlias www.domain-name.com

    # Rewrite domain name to not use the 'www'
    RewriteEngine On
    RewriteCond %{HTTP_HOST}    !^domain-name\.com$ [NC]
    RewriteRule ^/(.*)  http://domain-name.com/$1 [R=301]

    # Specify where the error logs go for each domain
    ErrorLog /var/logs/httpd/current/domain-name.com-error_log
    CustomLog /var/logs/httpd/current/domain-name.com-access_log combined

    <IfModule mod_fastcgi.c>
        SuexecUserGroup domain-name domain-name
        ScriptAlias /cgi-bin/ "/var/www/cgi-bin/domain-name/"
        <Directory "/var/domain-name/home/html">
            Options -Indexes FollowSymLinks +ExecCGI
            AddHandler php5-fastcgi .php
            Action php5-fastcgi /cgi-bin/php-fastcgi
            Order allow,deny
            Allow from all
        </Directory>
    </IfModule>
</VirtualHost>
# End domain-name.com section

Things to note:

  • The line with SuexecUserGroup should have the user/group for the project.

Create the php-fastcgi file

Add a /var/www/cgi-bin/projectname/php-fastcgi file for each project. This allows php to run as FastCGI, and use suEXEC. The php-fastcgi file needs to be under suexec’s default directory path /var/www/cgi-bin/.

  • #!/bin/bash
    #  Set PHPRC to the path for the php.ini file. Change this to
    #  /var/projectname/home/ to let projects have their own php.ini file
    PHPRC=/var/domain-name/home/
    export PHPRC
    export PHP_FCGI_MAX_REQUESTS=5000
    export PHP_FCGI_CHILDREN=5
    exec /usr/bin/php-cgi

Things to note:

  • The directory and file created above must be have user/group of the project (the same as the user/group of the /var/projectname/ directory)
  • The directory and file must be executable and writable by the owner ONLY.
  • If you get Apache Internal Server errors, check /var/log/httpd/suexec.log
  • For each site, you can specify how much RAM the APC module can use. For large, busy sites, you set this higher. Not setting this defaults to 64MB, which is a bit more than needed for the average WP site. Change the last line in the /var/www/cgi-bin/projectname/php-fastcgi file:
    • exec /usr/bin/php-cgi -d apc.shm_size=128M

Change php.conf

Comment out everything in the /etc/httpd/conf.d/php.conf file so php is not loaded as a module when Apache starts.

Apache multi-threaded

Edit the /etc/sysconfig/httpd file to allow Apache to use multi-threaded mode (httpd.worker) which handles basic HTML files much nicer (less RAM). Uncomment the line with HTTPD=/usr/sbin/httpd.worker

Config Check

Check the Apache configuration files to see if there are any errors.

  • service httpd configtest

If all good, restart Apache

  • service httpd restart This will stop the running httpd service, and then start it again. Use this command after installing or removing a dynamically loaded module such as PHP. OR
  • service httpd reload This will cause the running httpd service to reload the configuration file. Note that any requests being currently processed will be interrupted, which may cause a client browser to display an error message or render a partial page. OR
  • service httpd graceful This will cause the running httpd service to reload the configuration file. Note that any requests being currently processed will use the old configuration.

Install APC

  • pecl install apc

Set up log rotation for Apache

  • Add a file /etc/logrotate.d/httpd.monti
  • /var/logs/httpd/*log {
        daily
        rotate 365
        compress
        missingok
        notifempty
        copytruncate
        olddir /var/logs/httpd/archives/
        sharedscripts
        postrotate
            /bin/kill -HUP `cat /var/run/httpd/httpd.pid 2>/dev/null` 2> /dev/null || true
        endscript
    }

Atop – Apache Top, for keeping tabs on the web servers

When I first became a systems administrator of a large web server, I wanted to know what the current traffic to all of the virtual hosts (vhosts) looked like. I wanted to see which domains were getting the most traffic and where that traffic was coming from. So began my long search for a sufficient tool. There are many out there (apache-top, Apachetop, wtop, htop, IPTraf, etc). But they didn’t do all of the things I wanted. Basically they were just command line versions of the output of Apache mod_status, or they did complex log analysis.

I wanted more. The ability to search, or show only a certain domain name, see a list of IP address and how many connections from that IP address (to detect botnet attacks), and more.

So in true sys admin fashion, I built the tool myself. It is sufficiently stable and usable enough to warrant a blog post and hopefully engender some usage by others, which hopefully will encourage ideas and improvements from the community. Go ahead and grab a copy from the github repo, https://github.com/mossiso/atop

My idea is not much different than some of the tools I linked to. I’m basically writing a wrapper around the Apache mod_status output, but this tool has the ability to do more. So here’s a little walk through of what this tool does.

Requirements

  • Apache with mod_status: This tool is built around the Apache mod_status output, so that obviously has to be installed and set up. The ExtendedStatus option has to be enabled in the httpd.conf file.
  • links: This is a command line based web browser of sorts. Using the -dump flag, it just spits out the page to the command line.
  • netstat: This is used for one of the options to display all of the IPs connected to the webserver (via port 80).

 

This tool is just a BASH script, so once you download the “atop” file, just plop it anywhere in your home directory on your web server, change the permissions so it is executable

[code lang=”bash”]chmod 700 atop[/code]

and run it

[code lang=”bash”]./atop[/code]

There are now several options you can sort the results by:

==============================================================
a = sort all threads by time
c = sort by CPU, no GCRK_
i = list IPs connected to port 80 (uses Apache Server Status)
k = sort by K (Keep alives)
l = list IPs connected to all ports (uses netstat)
n = list IPs connected to port 80 (uses netstat)
o = sort open connections by CPU
p = sort only POST threads by time
r = raw apache status output (good with limit of at least 50)
s = search for a term, returns raw Apache Server Status results
w = sort by inactive workers
q = quit

To see the list of options while the command is running, just type any key on the keyboard.

Getting the BASH script to be responsive to the keyboard was tricky, and took me the longest time to figure out. For a while I could get the results to be displayed and refresh every N seconds, I could even get it to do the sort options, but only if I started the script with that option. So I was super excited to figure out the logic to get the script to respond to input.

The trick lies in setting the output commands in an infinite while loop. At the end of the loop it does a regular bash prompt using “read”. Normally this waits for a response, but the timeout feature allows you to set that to one second, which then goes through the while loop again. If a key is pressed, it breaks the while loop and prints the options message. When an option is selected it goes through that while loop.

Features

Some of the sort options I use most often are POST (p), CPU (c), IPs according to Apache (i), and IPs according to the server (n). I walk through those one by one.

POST

POST-listing

This is probably the most helpful of the options. Usually, when a website is getting hammered, it’s because it is getting comment spam or login attempts. These all require POST requests. If you see a large number of POST requests for a single vhost, then look at the IP addresses sending the requests; you can bet if all the requests are from the same IP, that it should be blocked.

CPU

CPU-list

This is a pretty good overview of what Apache traffic your server is handling. It shows GET and POST requests and sorts them with the most heavy CPU usage requests on the bottom. It filters out open processes with no connections, and a few other things like closing connections.

IPs (Apache)

IP-Apache-list

This one is great, too. It shows the IP addresses that are connected to Apache, and sorts them by how many connections are being made. The IPs with the most connections are at the bottom. If you see an IP address with over 10 connections for a few minutes, you can bet they are up to no good. Double check with the POST option to see if they are spamming.

IPs (Netstat)

IP-Netstat-list

This option gets all traffic to port 80 using netstat. It filters out local traffic (and GMU traffic, but you can edit that out), and then does the sorting and organizing by how many IP addresses are connecting. This gives a little more detail than the other IP option.

If you find any bugs in the script or have a great idea for other options, feel free to fork or submit patches, or report bugs on the github repo.

Four Steps to a Personal Website

There are four basic steps to creating a personal website.

1. Content

You may want to start out with a cool design, or fun idea on how to interact with visitors, or what have you. But really, the most important thing a website has going for it is the content. Design is a close second (but we’ll talk about that last), because people tend to shy away from “ugly” sites. But they won’t even go in the first place if the content isn’t relevant.

You’ll need to ask yourself a few questions to get an idea of what kind of website you need. The answers will even help define the design, and determine the platform, or website technology, that you use.

  • What information do you want to share?
  • Why do you want to make a website?
  • Do you want conversations to take place on your website?
  • Do you want a blog, a simple website with information about you or a topic, or something else?

2. Domain Name

All computers on the Internet or World Wide Web have a unique number associated with them, called an IP (Internet Protocol) Address. Kind of like a Social Security Number. In order to get data from a server (a computer that “serves” content, either data, websites, videos, pictures, etc), you would need to type in the specific number into your web browser. IP Addresses are in the format XXX.XXX.XXX.XXX. If you connect to the Internet at home, you might see your laptop get an IP Address like 192.168.1.20.

Since humans remember letters and words better than numbers, there is a system set up to translate words into the IP Address for a server. It is kind of like the old fashioned telephone directory. You can remember the telephone number to a person’s house, or look up the person in the phone directory to get their number. This also allows for multiple names to be pointed at one IP Address, like multiple people living in one  house, sharing a phone number.

This set of characters or words is called a domain name. A domain name allows for an almost unlimited number of unique identifiers to point to a limited number of IP Addresses. The domain name plays an important role in search engine rankings. If this is your personal site, try to get your name, or part of it, as the domain name. It can be all you need for “brand” identification.

Shop around before you buy a domain name. There are plenty of options out there, just do a search for domain registrar. Often a hosting provider will sell domain names as well. As of this writing, you should be able to get a domain name for around $10-$11 a year. Make sure the registrar includes DNS management.

.org, .net, .com, .info, .us, … What type of domain name should you buy? There are 19 top level domains (TLDs the very last part of a domain name, the part followed by the last period), and over 250 country code top-level domains (.us, .me, .de, .uk, etc.) That depends on a few things, the most important being, which one is available. Generally, .com and .org are the most sought after. Here is a list of the top-level domains and their intended purposes (from Wikipedia).

.com commercial This is an open TLD; any person or entity is permitted to register. Though originally intended for use by for-profit business entities, for a number of reasons it became the “main” TLD for domain names and is currently used by all types of entities including nonprofits, schools and private individuals. Domain name registrations may be challenged if the holder cannot prove an outside relation justifying reservation of the name, to prevent “squatting“.
.info information This is an open TLD; any person or entity is permitted to register.
.name individuals, by name This is an open TLD; any person or entity is permitted to register; however, registrations may be challenged later if they are not by individuals (or the owners of fictional characters) in accordance with the domain’s charter.
.net network This is an open TLD; any person or entity is permitted to register. Originally intended for use by domains pointing to a distributed network of computers, or “umbrella” sites that act as the portal to a set of smaller websites.
.org organization This is an open TLD; any person or entity is permitted to register. Originally intended for use by non-profit organizations, and still primarily used by some.

Country code top-level domains can be used as well, often to create clever domain names (called domain hacks) like del.icio.us, bit.ly, instagr.am, pep.si, and redd.it.

 

source, creative commons search on flickr.com
source, creative commons search on flickr.com

3. Hosting

A hosting provider is the company that owns the servers where your website lives. There are many free options. Look for a hosting provider that offers “easy” installations of common software like WordPress, Drupal, etc.

Paid options:

You can find a hosting provider from anywhere between $5/month to $100/month.

source: creative commons search on flickr.com
source: creative commons search on flickr.com

4. Design

Independent on the platform you choose, there are usually thousands of free themes available for easy download and install. When you pick a platform, look on their site for places to find free themes.

Designing a website takes lots of work to make it look nice. The better the design, the more resources are needed (be they time, or money).

Filling in the missing dates with AWStats

Doh!

Sometimes AWStats will miss some days in calculating stats for your site, and that leaves a big hole in your records. Usually, as in my case, it’s because I messed up. I reinstalled some software on our AWStats machine, and forgot to reinstall cron. Cron is the absolutely necessary tool for getting the server to run things on a timed schedule. I didn’t notice this until several days later, leading to a large gap in the stats for April.

What to do?

Fortunately, there is a fix. Unfortunately, it’s a bit labor intensive, and depends on how you rotate your apache logs (if at all, which you should). The AWStats Documentation (see FAQ-COM350 and FAQ-COM360) has some basic steps to fix the issue, outlined below:

  1. Move the AWStats data files for months newer to a temporary directory.
  2. Copy the Apache logs with all of the stats for the month with the missing days to a temporary directory.
  3. Run the AWStats update tool, using AWStat’s logresolvemerge tool and other changed paramaters, to re-create the AWStats data file for that month
  4. Replace the AWStats data files for the following months (undo step 1).

The Devil’s in the Details

Again, depending on how you have Apache logs set up, this can be an intensive process. Here’s how I have Apache set up, and the process I went through to get the missing days back into AWStats.

We have our Apache logs rotate each day for each domain on the server (or sub-directory that is calculated separately). This means I’ll have to do this process about 140 times. Looks like I need to write a script…

Step 1. Move the data files of newer months

AWStats can’t run the update on older months if there are more recent months located in the data directory. So we’ll need to move the more recent month’s stats to a temporary location out of the way. So, if the missing dates are in June, and it is currently August, you’ll need to remove the data files for June, July, and August (they look like this awstatsMMYYYY.domain-name.com.txt where MM is the two digit month and YYYY is the four digit year) to a temporary directory so they are out of the way.

Step 2. Get the Apache logs for the month.

First step is to get all of the logs for each domain for the month. This will work out to about 30 or 31 files (if the month is already past), or however many days have past in the current month. For me, each domain archives the days logs in the following format domain.name.com-access_log-X.gz and domain.name.com-error_log-X.gz where the X is a sequential number. So the first problem is how to get the correct file name without having to look in each file to see if it has the right day? Fortunately for me, nothing touches these files after they are created, so their mtime (the time stamp of when they were last modified) is intact and usable. Now, a quick one-liner to grab all of the files within a certain date range and put their content in a new file.

We’ll use the find command to find the correct files. Before we construct that command, we’ll need to create a couple of files to use for our start and end dates.

touch --date YYYY-MM-DD /tmp/start
touch --date YYYY-MM-DD /tmp/end

Now we can use those files in the actual find command. You may need to create the /tmp/apachelogs/ directory first.

find /path/to/apache/logs/archive/ -name "domain-name.com-*" -newer /tmp/start -not -newer /tmp/end -exec cp '{}' /tmp/apachelogs/ \;

Now unzip those files so they are usable. Move into the /tmp/apachelogs/ directory, and run the gunzip command.

gunzip *log*

If you are doing the current month, then copy in the current apache log for that domain.

cp /path/to/apache/logs/current/domain-name.com* /tmp/apachelogs/

This puts all of the domains log files for the month into a directory that we can use in the AWStats update command

Things to note: You need to make sure that each of the log files you have just copied use the same format. You also need to make sure they only contain data for one month. You can edit the files by hand or throw some fancy sed commands at the files to remove any extraneous data.

Step 3. Run the AWStats logresolvemerge and update tool

Now comes the fun part. We first run the logresolvemerge tool on the log files we created in the previous step to create one single log file for the whole month. While in the <code>/tmp/apachelogs/</code> directory, run:

perl /path/to/logresolvemerger.pl *log* > domain-name.com-YYYY-MM-log

Now, we need to run the AWStats update tool with a few parameters to account for the location of the new log file.

perl /path/to/awstats.pl -update -configdir="/path/to/awstats/configs" -config="domain-name.com" -LogFile="/tmp/apachelogs/domain-name.com-YYYY-MM-log"

Step 4. Move back any remaining files

If you moved any of the AWStats data files (awstatsMMYYYY.domain-name.com.txt like for July and August in our example) now’s the time to move them back where they go.

 

Yeah, that fixed it!

 

Phew! The missing dates are back!

 

The Past in Color

This weeks installment of history found on the web includes links to a few sites with something special. Color photographs from the early days of color photography. Color somehow brings a photograph to more life, adds more detail, and helps get a better understanding of the time period. Sure you can see the style of clothes, for example, in a black and white, but did you know it was bright green!

Color images of Russia from 1910

The first site comes from the Boston Globe. These pictures are from Russia over 100 years ago! Absolutely amazing detail.

World War II films in color

Second we have a bunch of color moving pictures from World War II from a blog at salon.com. Color and moving pictures just makes it all the more real.

Historic Test Films

The third site is an archive of films from nuclear testing by the U.S. Department of Energy. Crazy the amount of destruction those armaments produced.

Goddard and a rocket

Fourth is a link to NASA’s Flickr account. Here is Flickr working with a number of the U.S. Government departments to archive some of their images and provide a more publicly accessible way for these public images to be… accessible. Kind of neat.

The Past meets the Present

Finally, the best for last. This site is all in Russian, so not too sure what he’s saying, but Sergey Larenkov has some neat images. They show a juxtaposition of World War II photos with current photos of the same place. It’s a really neat way to see how the damage would look if it were to happen today.

Many Mechanical Machines

Back again with another roundup of websites promoting some history. This weeks focus is on the computers and other machines.

Technologizer has come through in the past year or so with some really fun looks at technology of the past. Here are three:

15 Classic PC Design Mistakes

Weird Laptop Designs

132 Years of the videophone

It’s amazing how ugly and non-functional computers were in the early stages. They don’t seem to be anything like cars. Old cars, some of them anyways, become classics. They were made to look good. Somehow, I guess, computer manufacturers didn’t think computers would need any style. Sure they were made for businesses, but beige…. for everything? One of Apple’s biggest successes has been to transform the look of personal computers. No matter what you think about Apple as a company and Steve Jobs as a person, at least their stuff has some style (which has it’s own interesting history in that many styles come from old Braun products by Dieter Rams).

Old Computer Database

Small Gallery of Old Computers

Speaking of old computers… The Obsolete Technology Website has a plethora of information, a veritable archive, of old technology. It’s good to see someone is keeping the history of our tech junk. Newscientist also steps in with a small gallery of ancient (read older than 30 years) technology.

Macintosh Startup Chimes

Finally, a trip down memory lane with all of the old Macintosh start up sounds at Geekology.

The history of abandoned things

Buried in sand: The abandoned Rubjerg Knude Lighthouse

I came across the site Artificial Owl this week. Artificial Owl finds images of long forgotten, man-made objects, locates them on a map and tells a bit of the story behind the object if possible. I was initially struck by the beautiful images of buildings, ships, airplanes, and automobiles left to deteriorate and crumble back to nature. I love the imagery of nature reclaiming her elements. It’s a definite reminder that man and his creations will not last longer than mother earth.

There were a number of images of airplanes, which I was glad to see, since I like them most. One of them was particularly interesting for a couple of reasons. First of all, it’s a picture of a B-29. I love B-29s and B-17s from World War II. The reason, besides them being beautiful airplanes, is because my grandfather was a belly gunner in a B-17 during WWII. He, and an uncle, had numerous models and paintings of B-17s and other WWII fighter planes in my grandpa’s basement. I even put together a model of a B-17 myself as a kid. So, naturally, my interest was peaked. Another reason I was interested in this picture is because I remember hearing about it. The very brief history of this airplane goes like this.

B29 Kee Bird, on frozen lake near Thule, Greenland

In 1947 the Kee Bird (as it was called), was on a top secret spy mission. It made an emergency landing on a frozen lake near Thule, Greenland. The crew were all fine and rescued, but the airplane was left. In the 1990s, a few older gentlemen decided they would rescue the airplane. After lots of money and time (nearly a year), they repaired the engines and minor damage to the plane, and prepared to take off. As they taxied around the bumpy frozen lake, one of the generators used for power broke loose and started a fire in the airplane. All of the crew escaped, but the airplane was destroyed by fire. When the lake melted in the summer, the plane sank to the bottom, never to be seen again.

All of this got me thinking that this would be an awesome way to do history. There is so much information connected to this one airplane, that it could easily fill a book or documentary. I haven’t read either of them, but either would definitely be an interesting read. It would be fun to research the history of the plane, the details and reasoning behind the flight, the biographies of the crew, and all that jazz. There is so much history that can be incorporated into the story of this airplane.

This could be done for all of the images on Artificial Owl, in fact for any abandoned man-made object. As I reflect on that, this is precisely what I want to do with my dissertation. I want to focus on one abandoned tunnel in Halberstadt, Germany. It was used by the Germans before, during and after WWII. It think telling the story of this tunnel can incorporate many aspects of German history around it. Time will tell on how that works out.

Well, I’ll leave you with a few more pics of possible historical tales…

B29 Kee Bird, abandoned plane, near Thule, Greenland

Abandoned old planes at La Paz - Jfk International (El Alto) Airport - Bolivia

Shipwreck of the Galant Lady on Bimini island, Bahamas

Antonov An-8 at rest in Russian woods.

 

(All images courtesy Artificial Owl, used without permission – thanks!)