Friday 3 May 2013

PHP: solving a real problem

G'day:
As you might have read, I discovered PHP has an internal web server, just like CF does. It seems a lot more basic than ColdFusion's one... especially the one that CF10 uses which is pretty much production-quality.

One thing I have not been able to find out how to do is to get it to list the contents of a directory, as one can enable on Apache / IIS etc. I'm pretty sure it doesn't support this functionality at all, as the docs mention a few options, but not that.


As I generate a lot of small files when doing my explorations or writing stuff for this blog, it's a right pain in the bum to have to type in all the URL myself, so I want to solve this. I am sure someone else has already done so, but I'm not going to google one up, I'm gonna work out how to do one myself. And then google afterwards to see how other people do it.

And I am going to document the entire exercise here, as I undertake it (so there's be cock-ups and standard Adam coding stupidity going on here. Sorry about that in advance).

First up, I'm pretty sure I have in my head what I need to do, but I'm gonna knock together a CFML solution as proof-of-concept, and then use that as a basis to implement a PHP solution.

What I need to do is this:
  • take a directory on the URL, which by default will be the current directory;
  • list the files / directories;
  • for each entry, convert the filesystem path to a URL, using the current HTTP host and port;
  • also provide a "parent directory" link too, for browsing upstream.
That's about it, I think. I could output file sizes and dates and all that sort of shite, but I have never cared about that on the listings IIS or Apache generate, so I'm not going to bother here.

Right... a CFML solution (gimme 10min...)


OK, wow, that took a while longer than 10min, and is actually quite a bit more code than I expected it to be. It was about an hour altogether I guess. My initial implementation of it had rather a large security hole in it which took a while to sort out, and was the basis for my "quick quiz" article the other day. Initially to check whether the directory was within the webroot (as opposed to just "C:\" or something) I was just checking the user-input to see if it started with the webroot dir, and that was that. But this approach wasn't comprehensive, because simply doing that, one could pass this as the URL: "C:\inetpub\wwwroot\..\..\windows" (for example). And the comparison would match "C:\inetpub\wwwroot" as being in the webroot, so it passed validation. Of course it shouldn't pass validation though. This is why I'm using the rather handy Java method getCanonicalPath(), which does this:

Returns the canonical pathname string of this abstract pathname.

A canonical pathname is both absolute and unique. The precise definition of canonical form is system-dependent. This method first converts this pathname to absolute form if necessary, as if by invoking the getAbsolutePath() method, and then maps it to its unique form in a system-dependent way. This typically involves removing redundant names such as "." and ".." from the pathname, resolving symbolic links (on UNIX platforms), and converting drive letters to a standard case (on Microsoft Windows platforms).

Every pathname that denotes an existing file or directory has a unique canonical form. Every pathname that denotes a nonexistent file or directory also has a unique canonical form. The canonical form of the pathname of a nonexistent file or directory may be different from the canonical form of the same pathname after the file or directory is created. Similarly, the canonical form of the pathname of an existing file or directory may be different from the canonical form of the same pathname after the file or directory is deleted.
So the canonical path for "C:\inetpub\wwwroot\..\..\windows" is "C:\windows", which clearly fails the comparison with the webroot directory, so is rejected by the logic. Cool.

Anyway, this might not be the ideal code, but it's good enough for the purposes of this exercise: a guide for the logic I need to write in PHP. So here I go. I'm just gonna work from the top to the bottom of the file, working out what I need to do as I go.

The first thing I need to do is to work out:
  1. how URL variables work;
  2. how to get the current directory.
URL variables in PHP are implemented as $_GET, which makes sense, and is nice and clear. And to get the current directory one uses the getcwd() function ("cwd" being "current working directory"). And the last piece of that puzzle is the <cfparam> equivalent in PHP, which I could not find, so instead just use isset() in an if statement. isset() is like parameterExists(), I guess. Remember parameterExists()? (I hope you're not bloody using it in your code still! ;-).

So here's the code so far:


<?php
    if (isset($_GET["dir"])){
        $dirCurrent = $_GET["dir"];
    }else{
        $dirCurrent = getcwd();
    }
    echo $dir;
?>

My mate Chris told me off for using isset() precisely because it is like parameterExists(), and I've changed this to use array_key_exists() instead, in the final code.

Next I need to find out how to do these things:
  1. do the equivalent of expandPath(), or use some other mechanism to get the webroot directory;
  2. the PHP equivalent of getCanonicalPath();
  3. and getting the file system's slash character;
  4. and the equivalent of the CGI scope, so I can built the URL.
One thing I need to get used to with PHP is that when I can't do something in CFML, I just use Java instead. One cannot do this in PHP. I only realise now how powerful ColdFusion is in this regard, and is probably something that isn't marketed heavily enough. I don't know Java much beyond "G'day World", but I can use all the functionality Java offers via CFML code. Cool. But this does not help me with my PHP.

However it's all easy enough. And hey, finding help on Google for PHP is a breeze. I thought I'd point that out. The solution to this step is as follows:

$dirCurrent = realpath($dirCurrent);
$dirBase    = realpath($_SERVER["DOCUMENT_ROOT"]);
$urlBase    = "http://" . $_SERVER["HTTP_HOST"]; 

There's a predefined constant for the slash character: DIRECTORY_SEPARATOR, so I won't both setting a variable holding it.

Setting the $urlBase variable in PHP is easier... the HTTP_HOST value includes the port. I don't think this is actually correct for it to do this, but it does make life easier. I'm just using the inbuilt server for this, so I will check that Apache and IIS behave the same when I get home.

The next bit starts demonstrating the incoherence of PHP slightly. The check for whether the directory is legit is this in PHP:

if (strpos($dirCurrent, $dirBase) !== false && file_exists($dirCurrent)){

strpos() returns either a numeric (a zero-indexed position of where a match was found), or a boolean (if no match was found it returns false). It makes my skin crawl to think of a function which can return different data types depending on the result. Also note that it can actually return two possible values which will be interpreted as false: either a false for no match, or a zero if there's a match at the beginning of the string. How bloody stupid is that?

Also we see that PHP doesn't have an internally-consistent coding standard: one of the function I use there has an underscore word separator, the other doesn't. Messy. There's something to be said for CFML being developed by one team, with one idea of how things should be named. file_exists() works for both file and directories though, which is good.

The PHP messiness gets more obvious with the next bit of code:

$relativePath = str_replace($dirBase, "", $dirCurrent);
$urlPath = str_replace("\\", "/", $relativePath);

Notice two of the string functions we've encountered: strpos() and now str_replace(). This seemingly random approach to underscores (which I dislike anyhow) is going to irk me.

I came unstuck for about 5min here as I misread the docs for str_replace():

mixed str_replace ( mixed $search , mixed $replace , mixed $subject [, int &$count ] )

I thought its last argument (which I am not using in the code above, as it turns out) was a count of how many replacements to do (like replace() in CFML has "ONE" or "ALL"). But no, it returns the number of replacements made. And this is actually very clear once I RTFM (and searched StackOverflow):
If passed, this will be set to the number of replacements performed.
That's quite cool, actually.

Now I had a bit of an issue... here's the CFML logic I'm wanting to re-task as PHP:

parentDir = listDeleteAt(dirCurrent, listLen(dirCurrent, "\/"), "\/");

PHP doesn't have list functions per-se, so I needed to take a different approach. I've decided to use a regex to lop off the last directory name, using preg_replace(). This took me about 20min to work out how to use. Initially I tried this:

$parentDir = preg_replace("[/\\][^/\\]+$", "", $dirCurrent);

Which is what I'd use in CF (the regex pattern is correct for what I want to do), but it errored with:

Warning: preg_replace(): No ending matching delimiter ']' found in D:\websites\php.local\blog\listing.php on line 16

After RTFMing a bit, I noticed that one needs to delimit the pattern like one would with JS, eg:

/regex_here/

So I tried that, and quickly detected I was gonna have problems with slashes: the regex processor is going to see my / in the pattern as a delimiter. I tried all sorts of things to escape everything that needed to be escaped, and ended up with this:

/[\\\\\/][^\\\\\/]+$/

That works, but it's a mess. Note there's so many back-slashes in there because not only do I have to escape the backslashes in my pattern to escape them (so the regex processor doesn't think they're an escape character), but backslash is also an escape character in a PHP string, so I need to escape each of them again. So the first four backslashes are a PHP-escaped regex-escaped backslash, and the fifth one is a regex-escaped forward slash. I'd actually expected to need to double-escape that backslash too, but don't need to for some reason.

Then looking more into PHP regex syntax, I see that one doesn't need to use "/" as the delimiter, one can use "#" too, so I can simplify things a bit:

#[\\\/][^\\\/]+$#

Again, I think I have one too-few backslashes in there: a backslash in a regex needs to be escape (so that's two) and each of those two needs to be escaped in a PHP string, making a total of four. I'm misunderstanding something here, and will research this a bit later.

So, anyway, the next bit of code is this:

// provide a link to go up a level
$parentDir = preg_replace("#[\\\/][^\\\/]+$#", "", $dirCurrent);
$parentLinkUrl = $_SERVER["SCRIPT_NAME"] . "?dir=" . $parentDir;
echo "<a href=\"$parentLinkUrl\">[Parent Directory]</a><br>";

Now we need to get the file listing of the current directory. PHP works slightly differently than ColdFusion does here. In CF we call directoryList() and that returns the listing from the whole directory. Cool. That's how I just assumed PHP would work too, but it doesn't. And the PHP approach goes to show how CF simplifies stuff nicely sometimes.  Here's some code looping over a directory in PHP:

opendir($dirCurrent); 
while ($entry = readdir()) {
    echo "$entry<br>";
}
closedir();

First we create a file resource, with opendir(). I don't actually need the resource variable, so I don't bother setting it, however the general syntax for it is thus:

resource opendir ( string $path [, resource $context ] )

I then use readdir() to read the next entry from the directory, assigning it to $entry, which I use as a loop condition. When readdir() runs out of things to read, it returns false, which is an exit condition for the loop. The syntax for readdir() is:

string readdir ([ resource $dir_handle ] )

Conveniently if one does not specify the directory handle to use, it assumes you mean the most-recently created one. That's handy.

Finally I close the directory off with closedir(). Normally if I was doing file operations when I need to open/use/close a file I'd error trap it along these lines:

try {
    // open it
    // use it
}
catch (any e){
    // possibly handle it, but at a minimum...
}
finally {
    if (open){ // it might have failed on the open process
        // ...close it
    }
}

And I tried do do this here, but after a while of staring at my code wondering why I was getting this error:

Parse error: syntax error, unexpected '{' in D:\websites\php.local\blog\listing.php on line 37

I RTFMed about exception-handling more closely and realised finally only got added to PHP in 5.5, and the most recent Windows version of PHP is 5.4 So I dispensed with all that, and just threw caution to the wind. This is just an exercise after all, not production code.

Anyhow, that code above produces this sort of result:

.
..
blog
controlStatements
gdayworld.php
index.php
variables

note the . and .. at the top, and it otherwise returns the file or directory name.

Putting this into practice, I can now finish up the loop and output:

echo "<ul>";
opendir($dirCurrent);
while ($entry = readdir()) {
    if (!in_array($entry, [".", ".."])){    // skip these two
        $linkText = $entry;
        $wholePathToFile = $dirCurrent . DIRECTORY_SEPARATOR . $entry;
        if (is_dir($wholePathToFile)){
            // we just want to link back to this file, passing the dir
            $linkUrl = $_SERVER["SCRIPT_NAME"] . "?dir=" . $wholePathToFile;
            $linkText .= "/"; // just to make it more clear it's a dir
        }else{
            // we want a link to the actual file
            $linkUrl = $urlPath . "/" . $entry;
        }
        echo "<li><a href=\"$linkUrl\">$linkText</a></li>";
    }
}
closedir()
echo "</ul>";

The only other new thing here is how I use in_array() in PHP where I might use listFind() in CFML. Oh, and I've got a call to is_dir() in their too. There really seems to be no rhyme or reason as to when PHP decides to pepper its function names with underscores, does there?

Chris also pointed out some sloppy logic there: I've revised the check in_array() check to be like this:

if (in_array($entry, [".", ".."])){    // skip these two
    continue;
}
// rest of code here


The last thing to do is to handle the 404 situation if the directory is invalid. This is easier than in CFScript (because I have to break out of script to use <cfheader>); I just need to do this:

header("HTTP/1.0 404 Not Found");

Cool!

And that's it. Here's the whole thing:


Interestingly, sorting out the logic when doing the CFML version took longer than converting it to PHP. I guess that's reasonable as working out how to do something is going to take longer than just typing it in, but even with all the googling for how to do stuff, knocking this out in PHP was pretty quick.

If you're more along the PHP path than I am, and see any glaring errors in what I have done here, please let me know. This is all very learning-curve for me, so I expect it to be sub-optimal, and the best way to improve is to be told where I'm wrong.

Cheers.

--
Adam