CPE 401/601 Computer Communication Networks

Spring 2010

Network Lab 4 : CGI Search Engine

Due on Wednesday, Apr 28 at 1:00 pm

Your assignment is to write a CGI based web search engine! Since we don't really have a database that could be used to find all web pages with a given phrase, we will cheat and use an external search engine to do the work. Your CGI program should receive a search query and then issue an HTTP request to your favorite search engine, gather the results (the HTTP response), do a little parsing and re-formating and send the results back to the client. As far as the client is concerned, you are providing a search engine. In reality, your program is simply using an existing search engine to do the work.


Deliverables

You must submit all the source code used for your search engine and a link to your live CGI program (so we can test it out). We can use a script that will extract the link from your README file, so make sure you provide the link according to the exact instructions below.

You must include in your submission a file named README that includes your name and a brief description of your submission, including the name of each file submitted along with a one line description of what is in the file. The first line of this file should contain an absolute URL that indicates where your CGI program is (so we can test it out). This first line must not contain anything other than the URL!

If your code is not complete, tell us what works and what doesn't. If you are submitting code that does not compile, please tell us that as well. If any of your code was written by someone else, you are required to tell us about it (this must also be documented in the code itself).

Finally, feel free to include a description of any problems you had or anything else you think might be helpful to us.

Note: (Security) Your CGI program can be run with any kind of query, not just the queries you expect (based on forms you've created). It's often a good idea to build in some default behavior, so that your CGI does something intelligent (like return an HTML error message) if the query is not valid.


Grading

Your project will be tested to make sure it works properly. Will will use a standard web client, to test your server, and will try various search strings using the form you provide. We will also send requests (using a browser) with nonsense query strings so make sure that your CGI program does not do something dangerous based on the query string entered.

Your grade will depend on the functionality (7.5) and the code quality (2.5). Checking for security issues is +2 points.


Submitting your files

Submission of your homework is via WebCT. You must submit all the required files in a single tar or zip file containing all the files for your submission.



Overview of CGI

CGI stands for Common Gateway Interface, a standard mechanism used by many Web servers to support the creation of dynamic documents by external programs.

There are many issues involved with the creation of CGI programs:


How does a browser tell the Web server to start your program?

Before the answer, there are a few things that we need to assume are understood.

  1. By Web Server we mean an HTTP server running on a machine that you have access to. There are many general purpose Web servers available, including servers from Netscape, Microsoft and a free server commonly used on Unix machines named Apache. These servers handle requests for static HTML documents (or images, or whatever content is stored in files accessable to the server process) in addition to providing access to CGI programs.

  2. Before you can start thinking about creating and using a CGI program, you need to know how to make your files available through the Web server. On most Unix systems, the Web server will direct HTTP requests for URLs that start with /~yourname to a special place within the home directory for the user whose name is yourname. Typically the special place is a directory named public.html or public_html. If you want a file (or CGI program, or image, or whatever) to be available through your Web server, you have to put the file in this special place.

    Most Web servers are configured to automatically respond to a request that maps to a directory name with either:

    • if there is a file in the requested directory named index.html the contents of that file are sent to the browser.
    • -or-
    • An HTML rendering of the contents of the directory (a list of files and subdirectories).

    To create a home page for username you create the file ~/username/public_html/index.html, put some HTML "stuff" in it and make sure that the file (and the directory public_html) are readable by anyone.

    > cd ~
    > mkdir public_html
    > emacs public_html/index.html
    > chmod go+r public_html public_html/index.html
    

    Now anyone in the world can view this home page by requesting the document at http://yourmachine/~username/

OK, back to the original question. A browser will send your Web server an HTTP request (GET or POST) in which the resource name specified corresponds to your executable CGI program. So if you have a CGI program in your public_html directory named mycgi.cgi the browser would send a request that asks for /~username/mycgi.cgi. Some folks configure their Web server to only allow requests for CGI programs in the directory /cgi-bin. In this case you need to be able to put your program there, otherwise the Web server will simply send back the contents of your program (the file itself) rather than running your program and sending back it's output.

Most servers that are configured to allow users to have CGI programs in their home directory (really somewhere in ~/public_html) require that the file name of a CGI program ends in the suffix .cgi. The Web server looks for this suffix and decides whether to run the program in the requested file, or whether to simply return the contents of the file.

The web server cgi.cs.rpi.edu looks for your files in ~/public_html and require that your filenames end in .cgi


How does a browser format a query?

There are two major ways the query is constructed:
1)  

If the query is created by the browser based on an ISINDEX tag (where the user can enter a single line of text and press Enter to submit the request), the browser submits a GET request specifying a resource (filename) that is either :

  • specified in the ISINDEX tag as the ACTION

  • if no ACTION is specified in the ISINDEX tag the browser will specify the name of the current document as the resource (the document that contains the ISINDEX tag).

In both cases the browser will append a '?' to the resource name, followed by the string the user typed in (possibly encoded - see the next section for details on the encoding).

Examples:

The following HTML contains an ISINDEX tag with an ACTION property:

<H2>Enter a search string and I'll find what you are looking for</H2>
<ISINDEX ACTION=http://foo.com/search.cgi><BR>
<CENTER>press Enter to submit your query</CENTER>

If the user types in "blahblah" and presses Enter the browser will connect to the web server on foo.com and submit something like this:

GET /search.cgi?blahblah HTTP/1.0

The following HTML contains an ISINDEX tag with no ACTION property. The generated request will use the same resource name that the document itself came from. The document containing this HTML was retrieved from the URL http://foo.org/count.

<H2>Enter a string and I'll count the letters for you</H2>
<ISINDEX ><BR>
<CENTER>press Enter to submit your query</CENTER>

If the user types in "abcdef" and presses Enter the browser will connect to the web server on foo.com and submit something like this:

GET /count?blahblah HTTP/1.0

In this case the resource /count seems to refer to both a document and to a CGI program. This can be accomplished by having a CGI program that simply returns a document if no query is submitted (an empty query).


2)  

If the query is constructed based on the content of an HTML form, the form itself specifies whether the request will be a GET or a POST. GET is usually used only for small requests, this is because the mechanism used by the web server to send the query from a GET request to the CGI program has size limitations (more on this later).

If a GET method is specified in the HTML form, the browser creates a query string based on the values the user typed in the form fields and appends it to the resource name just like we saw with an ISINDEX tag.

If a POST method is used, the browser creates a query string based on the values the user typed in the form fields and sends this string (which may be large) as the content part of an HTTP POST query.

The query string itself is more complicated than with an ISINDEX based query since there may be many fields in the FORM. Each field in an HTML form has a name which is specified in the form itself (whoever created the form has to specify a name for each field in the form). Each field also has a value that the user can change by typing in a new value or by clicking on checkboxes or radio buttons or whatever. Once the user presses on the SUBMIT button, the browser constructs a query string that contains a sequence of name=value strings seperated by the '&' character. A few issues arise:

  • Since the '=' character seperates the name of a field from the user specified value, the '=' obviously can't be part of the name or part of the value. We can control the name (if we create the form), but we can't control the value the user types in.

  • Ditto for the '&' character.

  • If the query is submitted as part of a GET request, we can't have any spaces in the query or it will confuse the web server (the web server will think it has reached the end of the resource in the HTTP GET request).

The above issues are handled by having the browser encode the query in a way that avoids the problems. This is what happens:

  • All spaces are replaced by the '+' character.

  • If the character '&' is part of any field value it is replaced with the string "%26". This string is used because the ASCII '&' character has hexadecimal value 26.

  • If '=' is part of anything, it is replaced by "%3D" (the ASCII hex code again).

  • Just about any non-alphanumeric character is replaced by it's ASCII hex equivalent in the same manner.

The encoded string is now sent to the web server, which will pass this mess on to your CGI program. You CGI program will have to undo all this encoding!

NOTE: The encoding described above is done by default, you can override this encoding by specifying an alternative encoding type in the form itself. To do this you set a value for the ENCTYPE attribute of a FORM tag. Another supported encoding is the type (this is a MIME type) multipart/form-data. This results in the browser sending you the form field names and values unencoded, although wrapped in a MIME multipart document. The major use of MIME is when an entire file is sent from the browser to the server as can happen with a INPUT field of type FILE. See the getfile CGI program and HTML form for an example of how to do this.

Examples:

The following HTML form contains 2 fields, one named fname that we hope the user will use to submit his first name, and a field named lname for his last name.

<FORM METHOD=GET ACTION=http://www.foo.com/register.cgi>

First Name: <INPUT TYPE=TEXT NAME=fname><BR>
Last Name:  <INPUT TYPE=TEXT NAME=lname><BR>
<INPUT TYPE=SUBMIT VALUE="press to submit">
</FORM>

If the user types "dave or mehmet" as the first name and enters "lastname=foo" as the last name (remember that users can and will enter anything!), the browser will connect to the web server on www.foo.com and submit something like this:

GET /register.cgi?fname=dave+or+mehmet&lname=lastname%3Dfoo HTTP/1.0


The following HTML form contains the same 2 fields, but the method specified in the form is POST.

<FORM METHOD=POST ACTION=http://www.foo.com/register.cgi>
First Name: <INPUT TYPE=TEXT NAME=fname><BR>

Last Name:  <INPUT TYPE=TEXT NAME=lname><BR>
<INPUT TYPE=SUBMIT VALUE="press to submit">
</FORM>

If the user types "John" as the first name and enters "Doe a Deer" as the last name the browser will connect to the web server on www.foo.com and submit something like this:

GET /register.cgi HTTP/1.0 
content-length: 26
http-headers: whatever

fname=John&lname=Doe+a+Deer

In this case the same encoding takes place, but the query string is submitted as the content of the request, not as part of the resource name. You might also notice the request includes an HTTP header specifying the length of the content - this is important as we'll see soon...


How does the web server pass the query to a CGI program?

The answer depends on whether the CGI program is referenced with a GET request, or with a POST request.

GET method: When a GET request is received by the web server and the resource specified (everything before the '?') is your CGI program, the web server will grab everything after the '?' and stuff it in to the environment variable named QUERY_STRING. The web server will also set the environment variable REQUEST_METHOD to the value "GET". Then the web server will start up your CGI program connecting STDOUT of your program to a pipe the server can read. Your program should get the query by reading the environment variable QUERY_STRING, and then process the query and send the results to the web server by simply writing to STDOUT.

Many operating systems have limitations on the size of the environment variables - this might get in the way if you have large queries, since the entire query must be able to fit in QUERY_STRING. Most non-trival queries are submitted using the POST method.

POST method: When a POST request is received by the web server and the resource specified is your CGI program, the web server will read the HTTP headers (including the one specifying the content length) and set the environment variable CONTENT_LENGTH. The REQUEST_METHOD environment variable will be set to POST. Now your CGI program will be started up with STDIN and STDOUT attached to pipes going back to the web server. The server will now write the entire query string (the content of the POST) to the pipe connected to your STDIN.

Your program should get the length of the query string from the CONTENT_LENGTH environment variable so it knows how much to read from STDIN (coming from the web server). BE CAREFUL! Don't use a static array unless you are willing to refuse to read the entire query (it might be larger than your array and could screw up your program and make it possible for bad guys to break into your machine, delete all your files, send mail from you to the FBI suggesting that you might be someone they are looking for, etc ...).


How does a CGI program send content back to the client (browser)?

For starters, you have to tell the browser what kind of document you are sending. In most cases you will be sending HTML, and you need to tell the browser this by sending the string:

Content-type: text/html

before sending the content. This is actually a HTTP header you are sending, so assuming it is the only header you want to send you need to also send a blank line (and remember, all header lines should end with \r\n).

To send this header and the content back to the browser you simply write to STDOUT, which actually goes back to the web server via a pipe, and the web server forwards it to the browser. The web server will probably add a bunch of headers as well, generally we don't need to worry about this although there are ways to configure the server to not send any extra headers.

In short - just use printf to send the content back to the browser.


Security

When writing a CGI program you need to realize you are allowing anyone to run your program any time they want. This doesn't sound terribly harmful, but you should keep in mind that they can (and will) send all kinds of crazy stuff as the request. You must make sure that you don't allow an unexpected request to screw up your system.

Just about the worst thing you can do is to blindly construct a Unix command line based on a request, and give the command to a Unix shell to run (using popen(), system(), etc). For example, lets assume we have a CGI program that expects a keyword as a request, then greps a dictionary for all words the contain the keyword. So the intent is that if a user send the request "foo", the Unix command grep foo /usr/dict/words is constructed and run (using popen). However, if a user enters the query "; rm *" the resulting command would look like this:

grep; rm * /usr/dict/words

and you might lose a bunch of files.

One common theme among well know cracks (of many Unix services) is to overflow an input buffer in the server. If all servers were written correctly, this would not be a problem. You must make sure that read is never called with an input buffer smaller than the maximum size given to the read system call.

Here is a classic example of the problem. This code (from a CGI program) is handling a POST request, so it checks to see how large the request is (by getting the environment variable CONTENT_LENGTH) and then reads that much stuff from STDIN:

char buff[1000];
int len;
char *cl;

cl = getenv("CONTENT_LENGTH");
if (cl==NULL) {
  /* Error */
  exit(1);
}
len = atoi(cl);

read(buff,0,len);

This code never makes sure that len is less than 1000!!!! At a minimum this makes it easy to crash your server, in the worst case some clever hacker (with lots of spare time) could use this to break in to your computer.

For more information about CGI security check out:

  • The CGI section of the WWW Security FAQ

    HTML Forms

    Here are some links to get you started writing HTML forms:


    Acknowledgement: The assignment is modified from Dave Hollinger.