Filtering and Mapping with SPL Iterators

Four or five years ago, the most popular talk that I gave at conferences was entitled “A Functional Guide to Cat Herding with PHP Generators” (the cats proved a very enjoyable talking point at every event where I gave it, because who doesn’t enjoy a technical talk featuring cats); and I even wrote a blog post here about that self-same topic. That presentation described how I was building up enormous volumes of GPX tracking data showing where my cats roamed each day. In order to analyse that data without using equally enormous amounts of memory, I read it directly from the GPX files, using Generators to process each trackpoint one at a time; and and took a map, filter, reduce approach to resolving the questions I wanted to answer about my cats movement habits. Because PHP’s standard array_map(), array_filter() and array_reduce() functions only work with arrays of data, which are expensive in terms of their memory usage, I wrote my own versions of those functions to work with Traversable objects (like Iterators and Generators) as well as with arrays. And the technical body of the talk was how I wrote and used my versions of these functions.

I chose to write the functions myself as a learning exercise, to better understand how they (and Traversables like Generators) worked; but for map and filter, I could just as easily have used the Iterators that already existed in the Standard PHP Library (SPL).

Note that all of the code for these examples, as well as my original implementations of Traversable-friendly filter(), map() and reduce() functions is available online in a github repository. This includes the Generator used to traverse the GPX data, and some of the filter and mapper classes that I used for the data analysis, and they can be viewed there rather than my duplicating all of that code in this post.

The tracking data that I read from the GPX files was returned by a Generator, with the DateTime as the key, and the geospatial data (and the DateTime) as a simple value object.


namespace GpxReader;

class GpxElement {
    /**
     *  @var    \GpxReader\GpxLatLong    $position    Latitude/Longitude position
     **/
    public $position;

    /**
     *  @var    float    $elevation    Elevation above the surface of the WGS84 reference ellipsoid
     **/
    public $elevation = 0;

    /**
     *  @var    \DateTime    $timestamp    Timestamp
     **/
    public $timestamp;
}

class GpxLatLong {
    /**
     *  @var  float  $latitude  Latitude
     **/
    public $latitude;

    /**
     *  @var  float  $longitude  Longitude
     **/
    public $longitude;
}

Filtering

One of my basic queries was to identify where a cat had been within a certain timespan. Cats are creatures of habit, so if I knew where they had been sheltering from the weather on a wet and windy afternoon between 13:20 and 13:30, there was a reasonably good chance that they would be sheltering in the same location on another wet and windy afternoon at that time. So if I could filter the data between those times, I would know where I was likely to find that cat. Sadly, GPX data doesn’t record what the weather was like; but I could still write a filter based on the recorded timestamps.
SPL provides an abstract FilterIterator class that can be extended to build a filter, which will work against an iterated dataseries, and that’s a good starting point.

The basic FilterIterator accepts the Iterator that we want to filter as its constructor argument; then as we iterate over the records, it calls the accept() method that should return a simple boolean true/false to determine if that record should be returned, or not. We overload that accept method with our own code to identify whether we should return true or false; but we do need to retrieve the key or current value from the “inner” iterator (the Iterator that we passed in through the constructor) explicitly to match against our filtering logic.


class DateTimeRangeFilter extends FilterIterator
{
    private $startTime;
    private $endTime;
   
    public function __construct(Iterator $iterator, DateTime $startTime, DateTime $endTime)
    {
        parent::__construct($iterator);
        $this->startTime = $startTime;
        $this->endTime = $endTime;
    }
   
    public function accept()
    {
        $timestamp = $this->getInnerIterator()->key();

        return $timestamp >= $this->startTime && $timestamp <= $this->endTime;
    }
}

In this case, I’ve also overloaded the constructor as well, so that it will accept a start and end DateTime as arguments rather than embedding them directly in the filter logic, which makes the DateTimeRangeFilter reusable for different DateTime ranges. I still need to remember to call the parent constructor with just the $iterator argument (my GpxTracker Generator), which will become the “inner” iterator. Then the filtering logic is written in the accept() method. As my GpxTracker Generator returns the DateTime that I want to filter against in the key, I retrieve the key from that “inner” iterator, and compare that against my start and end DateTimes to determine whether it falls within the range that I’m interested in.

Running it requires iterating over the instantiated DateTimeRangeFilter with my GpxReader Iterator and the date/time range that I want to filter against, and only the entries from the GPX file that match the filter are returned.


// Define the date/time filter parameters
$startTime = new DateTime('2015-11-23 13:20:00Z');
$endTime = new DateTime('2015-11-23 13:30:00Z');

// Iterate over the trackpoint set from the gpx file, displaying each point detail in turn
foreach (new DateTimeRangeFilter($gpxReader->getElements('trkpt'), $startTime, $endTime)
         as $time => $element) {
    printf(
        '%s' . PHP_EOL . '    latitude: %7.4f longitude: %7.4f elevation: %d' . PHP_EOL,
        $time->format('Y-m-d H:i:s'),
        $element->position->latitude,
        $element->position->longitude,
        $element->elevation
    );
}

The result is something like:


2015-11-23 13:24:40
    latitude: 53.5441 longitude: -2.7416 elevation: 124
2015-11-23 13:24:49
    latitude: 53.5441 longitude: -2.7416 elevation: 122
...
2015-11-23 13:29:45
    latitude: 53.5441 longitude: -2.7414 elevation: 122
2015-11-23 13:29:55
    latitude: 53.5442 longitude: -2.7413 elevation: 123

with the displayed timestamps showing that the filter has been applied.

As the above example demonstrates, writing custom filter classes that extend FilterIterator is fairly straightforward; but it does feel a bit awkward having to retrieve the key/current element from the “inner” iterator, and overloading the constructor to pass in filtering arguments. But there is an alternative approach. In addition to the abstract FilterIterator class, SPL also provides two concrete filter classes that extend FilterIterator. One of these, the RegexIterator, I’ll save for another later post, but the second, the CallbackFilterIterator, is more intuitive to use than the FilterIterator, being very similar to the array_filter() function in that it accepts a callback as an argument, and can work with both keys and with values.

Using the same DateTime range filtering that I implemented above when creating my own concrete filter class extending the FilterIterator, I can implement the same logic for the CallbackFilterIterator as:


// Define the date/time filter parameters
$startTime = new DateTime('2015-11-23 13:20:00Z');
$endTime = new DateTime('2015-11-23 13:30:00Z');

// Create the filter callback with the date/time parameters we've just defined
$timeFilter = function(\GpxReader\GpxElement $element, DateTime $timestamp) use ($startTime, $endTime) {
    return $timestamp >= $startTime && $timestamp <= $endTime; };

// Iterate over the trackpoint set from the gpx file, displaying each point detail in turn
foreach (new CallbackFilterIterator($gpxReader->getElements('trkpt'), $timeFilter)
         as $time => $element) {
    printf(
        '%s' . PHP_EOL . '    latitude: %7.4f longitude: %7.4f elevation: %d' . PHP_EOL,
        $time->format('Y-m-d H:i:s'),
        $element->position->latitude,
        $element->position->longitude,
        $element->elevation
    );
}

I don’t need to define the concrete class, just the callback to implement the filtering logic, and I can inject my date range directly into the callback using use.
The callback itself accepts up to three arguments: the element value, the key, and the “inner” iterator itself; so my filtering logic has direct access to all of those if I need it to. This means that unlike the accept() logic in the basic FilterIterator, I don’t need to retrieve the value and key from the “inner” iterator in my code, they’ll be injected directly into my callback. And my callback can give them more meaningful names like $element and $timestamp (and even use type-hinting for them) rather than simple $value and $key, so it can improve readability (and IDE usage).

Another filter that I used extensively when analysing the cats’ movement patterns was a bounding box filter. I could use this to determine how much time the cats were spending in the house and/or garden, or whether they had been anywhere near the main road, or when I was most likely to find them away from the house. I implemented this as a class with methods that I could use as callbacks. (Code for the BoundingBox class can be found in the github repo.) Another benefit of the CallbackFilterIterator when working against the element value is that callbacks written to work with array_filter(), or with my own Traversable filters, is that they will also work directly (without any code changes to the callback) with the CallbackFilterIterator.


// Create a bounding box defining the coordinates we want to test each point against
$boundaries = new GpxReader\Helpers\BoundingBox();
$boundaries->setLatitudes(53.54382, 53.54340);
$boundaries->setLongitudes(-2.74059, -2.74005);
// We want to set the filter to include only points inside the bounding box
$boundingBoxFilter = [$boundaries, 'inside'];

// Iterate over the trackpoint set from the gpx file, displaying each point detail in turn
foreach (new CallbackFilterIterator($gpxReader->getElements('trkpt'), $boundingBoxFilter)
         as $time => $element) {
    printf(
        '%s' . PHP_EOL . '    latitude: %7.4f longitude: %7.4f elevation: %d' . PHP_EOL,
        $time->format('Y-m-d H:i:s'),
        $element->position->latitude,
        $element->position->longitude,
        $element->elevation
    );
}

It isn’t always quite as straightforward to convert an array_filter() callback using the key to a callback for CallbackFilterIterator, because of the differences in the callback arguments, and the way that array_filter() changes the callback arguments depending on the flag used when filtering.

array_filter($array, function($value) {...}));
array_filter($array, function($key) {...}, ARRAY_FILTER_USE_KEY));
array_filter($array, function($value, $key) {...}, ARRAY_FILTER_USE_BOTH));

Whereas the callback arguments for the CallbackFilterIterator are consistent – and always provide the option for your filter logic to use either the element value, or the key, or both – because the CallbackFilterIterator always passes both value and key (as well as the iterator itself) to the callback.

It’s probably also worth a reminder at this point that keys for Traversables can be duplicated, and can be of an datatype including objects (as in the case of my GpxReader where the key is a DateTime object), unlike keys for arrays which must be unique, and can only ever be integers or strings.

Mapping

Although the GPX tracking data was good to tell me where any of the cats was at any given time, and I could use that data to plot out their movements on a map, it provided no data for the distances that they travelled between those time plots. And I wanted to know just how far they had walked when out hunting, or how quickly they had been running. But from any known pair of times and locations, it’s possible to calculate distance and speed. I chose to use a mapper function to add that information to my basic GpxElement object, so each object also provided the distance between itself and the previous object. When working with arrays, array_map() (or the array_walk() function) provide the functionality to modify all the values of an array iteratively, and my original map() function served the same purpose. From SPL, the IteratorIterator can serve that same purpose: by allowing us to overload the current() (or key()) methods and modify the value.


class DistanceMapper extends IteratorIterator {
    private $distanceCalculator;

    public function __construct(Traversable $iterator, \GpxReader\Helpers\DistanceCalculator $distanceCalculator) {
        parent::__construct($iterator);
        $this->distanceCalculator = $distanceCalculator;
    }

    public function current() {
        $gpxElement = parent::current();
        return $this->distanceCalculator->setDistance($gpxElement);
    }
}

As we iterate over this DistanceMapper object, every call to current() triggered through the foreach() loop modifies the GpxElement object value retrieved from the parent Iterator (adding a calculated distance property to that object) before returning it so that it can be processed by the foreach() loop. The actual code to calculate the distance is implemented in the DistanceCalculator object using the basic Haversine formula, and that code also modifies the GpxElement before it is returned. (That code isn’t shown here, but is available in the github repository).


// Provide a distance calculator to calculate the distance between a trackpoint and the previous trackpoint
$distanceCalculator = new GpxReader\Helpers\DistanceCalculator();

// Iterate over the trackpoint set from the gpx file, mapping the distances as we go, displaying each point detail in turn
foreach (new DistanceMapper($gpxReader->getElements('trkpt'), $distanceCalculator) as $time => $element) {
    printf(
        '%s' . PHP_EOL . '    latitude: %7.4f longitude: %7.4f elevation: %d' . PHP_EOL .
            '    distance from previous point:  %5.2f m' . PHP_EOL,
        $time->format('Y-m-d H:i:s'),
        $element->position->latitude,
        $element->position->longitude,
        $element->elevation,
        $element->distance
    );
}

The result is something like:


2015-11-23 11:05:07
    latitude: 53.5440 longitude: -2.7404 elevation: 146
    distance from previous point:   0.00 m
2015-11-23 11:05:15
    latitude: 53.5440 longitude: -2.7403 elevation: 141
    distance from previous point:   7.66 m
...
2015-11-23 14:58:47
    latitude: 53.5439 longitude: -2.7426 elevation: 103
    distance from previous point:   2.56 m
2015-11-23 14:58:57
    latitude: 53.5438 longitude: -2.7426 elevation: 110
    distance from previous point:   2.66 m

So with a small change to the logic in my loop to sum the distances, I can calculate the total distance that cat has walked


$totalDistance = 0.0;
// Iterate over the trackpoint set from the gpx file, mapping the distances as we go, summing each distance in turn
foreach (new DistanceMapper($gpxReader->getElements('trkpt'), $distanceCalculator) as $time => $element) {
    $totalDistance += $element->distance;
}

printf(
    'Total distance walked:  %5.2f m' . PHP_EOL,
    $totalDistance
);

giving


Total distance walked:  4333.88 m

Complex Queries

As with my original map() and filter() functions, it is possible to nest multiple filters and mappers to apply multiple rules and transformations to our underlying data, to run more complex queries against our dataset.


// Iterate over the trackpoint set from the gpx file, displaying each point detail in turn
//     applying both a time filter (12:00:00 - 12:20:00 on 2015-11-23)
//     and the bounding box filter for inside the house/garden
foreach (new CallbackFilterIterator(new CallbackFilterIterator($gpxReader->getElements('trkpt'), $timeFilter), $boundingBoxFilter)
         as $time => $element) {
    printf(
        '%s' . PHP_EOL . '    latitude: %7.4f longitude: %7.4f elevation: %d' . PHP_EOL,
        $time->format('Y-m-d H:i:s'),
        $element->position->latitude,
        $element->position->longitude,
        $element->elevation
    );
}

You do need to take care of the order in which you nest filters and mappers if you have filters on data that requires the mapper to execute a data transformation first, such as using a mapper to calculate speed of movement between trackpoints, and then a filter to find when a cat was running (e.g. speed in excess of 5m over the 10 seconds between trackpoints).


To summarise, hopefully this article has demonstrated some of the power and flexibility available through the SPL Iterators, and perhaps helped to demystify some of the obscure naming and sparse documentation, and to show valid use cases where Iterators can be of benefit in your code.

This entry was posted in PHP and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s