# So, What's Going On Here?

Over the summer of 2018 three interns working for a collaboration between two Cambridge companies, Argon Design and Dovetailed, produced a fun and interactive tool they called the Gumpifier.

The Gumpifier is effectively an automatic Photoshop, a green screen without the green screen: given a photo of a person, and a photo of some background scene, it will automatically cut out the photo of the person, analyse the background picture, and attempt to place the person into it. The name 'Gumpifier' is, of course, inspired by the impressive visual effects in the film Forrest Gump, in which the eponymous hero was layered into historical footage.

In the Gumpifier, this is primarily done using artificial intelligence techniques. A 'convolutional neual network' performs the cutting out and colour correction algorithms are run to match the brightness and temperature to the background before the final image is composited.

In the following three sections of this blog post, each of the interns describes some of their work.

Hai-Dao, from Dovetailed, worked on designing the user interface with the target of providing both usability and transparency for the end user in light of the heavy, and potentially opaque, use of artificial intelligence.

Patrick, from Argon Design, worked on automatic colour correction of the picture of the person to make it match the background image.

Mohammed, also from Argon Design, worked on segmentation techniques for identifying and cutting out objects in photos.

# Hai-Dao -- Dovetailed -- User Experience

## Brief

The goal of the joint project was to use the strengths of Argon Design's software skill and Dovetailed's user experience (UX) expertise to create a fun tool powered by artificial intelligence (AI) informed by UX principles. The idea for the Gumpifier was borne out of an interest in cutting edge AI research and a desire to make AI approachable and human-centered.

## Process

After a joint brainstorm, we at Dovetailed began to research similar AI experiments and tools: Google's library of fun AI experiments was a great source of inspiration and fun around the office. Argon Design provided a proof of concept which gave an idea of how the Gumpifier would function, however, the user interface (UI) of the Gumpifier was far from decided. Our goal with the Gumpifier was to maintain a simple and playful UI that used familiar patterns and gestures. The Gumpifier is a fun tool that shouldn't require a steep learning curve or loads of text to use.

We went through four iterations of the Gumpifier, using internal feedback for the initial designs and testing with real users for the final two iterations.

Figure 1: Our initial user interface, as designed by Argon Design.

Figure 2: The first iteration of the Gumpifier.

Figure 3: One of the many mock-ups of the Gumpifier.

Figures 1 through 3 show the biggest change in visual design - this is when the concept of the Gumpifier was still in flux. We experimented with different functions and interaction patterns in order to achieve an optimal tool. Prototypes were evaluated on ease of understanding, efficiency in achieving the goal task, and visual playfulness. I defined 'playful' in this use case as something that elicited a giggle or a smile, whether through allusions to Forrest Gump or fun colours that added brightness to the tool. If our users were able to understand the goal of the tool, effectively utilize the Gumpifer to manipulate images and have fun while doing it, that was the golden standard of a user-centered process.

Figure 4: The design that inspired the current design of the Gumpifier.

Figure 4 is the closest to the design that you see today on the Gumpifer website - selection patterns and button placements have changed but the key features of the Gumpifier (moving, editing and saving) are all present.

Interactive prototypes were especially helpful for testing, as whiteboard sketches don't have quite the same functions digital products do. Once we began testing with digital prototypes it was important to test for any confusion traps by allowing users to make their own judgments of how the Gumpifer would function and accommodating for those uses with the next design iteration. With more testing and time away from the project, I think the next iteration could focus on refining visual elements and having more user input on the AI side (for example, by allowing users to select which elements of an image are part of the foreground or background). Even after the launch of the Gumpifier, we are still looking for user feedback and would love to hear what you have to say!

## User Experience of AI

AI is a hot topic that can garner reactions both positive and negative. Rather than the dystopian view of AI as a tool to replace people and eliminate humanity, we believe in a user-centered AI that empowers human flourishing by automating the boring stuff and giving people time to solve the larger issues. In the field of AI ethics, AI that cannot be explained is known as black-box AI. Black-box AI gives us answers without being able to trace how or why it made the decision it did.

User-centered AI is AI that is explainable and seeks to enhance rather than replace humans. During the initial brainstorm for the Gumpifier, we sought to create something that used AI techniques without eliminating user control. The Gumpifier depends on you for photo choices and allows users to edit the Gumpified image to whatever parameters they wish. The Gumpifier is merely an assistant, making the process of editing people in and out of photos much quicker and simpler. We have also included an explanation of what the Gumpifier does when it processes photos to keep users in the loop. In the future, tools that could help unite UX and AI development must be collaborative and creative in nature - maybe a design tool like Sketch that could link to dev tools like Sublime Text in order to demonstrate what AI tools are like to both designers and developers. True user-centered AI is about thinking about the needs of users from the very beginning of a project and ensuring that humans are the ones who will benefit most from your use of AI.

# Patrick -- Argon Design -- Colour Correction

## Introduction

“The Gumpifier will allow the user to input an image of themselves, along with a photo of a background scene, and it will automatically insert this image into the scene in a realistic manner”. Over my few weeks of interning at Argon Design, I had become well practised at explaining this brief to colleagues and relatives alike. Our remit posed an interesting problem: it was naturally and intentionally open-ended; it was defined only in high level natural language terms; there was no obvious starting point or success criteria. Our only constraints were that we must use Machine Learning (ML)/Artificial Intelligence (AI).

Our first port of call, having completed some crash courses in standard ML techniques and libraries, was to brainstorm areas which we thought feasible given our time constraints and resources. We came up with the following:

Figure 1: The blue boxes indicate slightly 'easier' areas. The green boxes show areas which we anticipated would cause us more problems, and the orange boxes indicate topics we thought could be very difficult. The orange line around some of the blue boxes hints that, although it is trivial to apply matrix transformations to an image, knowing which transformations to apply could be tricky, and may have to draw on themes from the orange boxes.

The most prominent aspects of the figure related to the positioning of the foreground image. This included translation, scale, rotation as well as segmentation (automatically cutting out the picture of the person in the foreground from the background against which they were taken). In another section of this post Mohammed, the other Argon intern, describes how he went about tackling these areas, however my task related to the top right blue box: colour correction.

## Hue Correction

The term 'colour correction' seems to be wide-ranging and encompass many areas. I started working on the 'hue' channel of a Hue-Saturation-Lightness (HSL) colour space. This seemed like a good idea at the time for a couple of reasons:

1. It seemed sensible to work with images in a colour space which has some physical meaning. Practically, this meant working in HSL or Hue-Saturation-Value (HSV).

2. Tensorflow had some very tempting built in functions with names like adjust_hue() and rgb_to_hsv().

Thus, I began, naively, to construct a convolutional neural network (CNN) which attempted to predict the correct shift in hue to apply to the foreground to make it match the background.

Unfortunately, it did not. In fact, the predictions consistently collapsed to a mean value, taking no account of the contents of the inputs.

The most sensible debugging route seemed to be to build up from a similar toy CNN and to find the point at which it had failed. This took the form of attempting to perform similar hue correction of a single, randomly coloured pixel to another one, then building up to a 2x2 pixel block, then 4x4, and at some point, switching to photographic, not random, data.

The debugging worked. Unfortunately, it highlighted the inevitable failure of working on 'hue correction'.

Let's think about the three channels with which we're working: hue, saturation and lightness. Conceptually, saturation and lightness are both linear scales, ranging from, say, 0 to 255. Hue is different. Hue is a 'circular' scale. We often think about it in degrees, where $0^{\circ}$ is red, $120^{\circ}$ is green and $240^{\circ}$ is blue, before we wrap round to red again. This creates a discontinuity whereby the same colour is represented at $20^{\circ}$, $380^{\circ}$, $740^{\circ}$, ..., for example.

Figure 2: Multiple distinct values of hue map to the same colour.

This is not an easy function for a neural net (NN) to learn. We can conceive of an architecture which should learn such a modulo function, or we could implement the modulo in the loss function itself, removing responsibility from the NN, but nothing seemed to get around the inherent problems caused by the discontinuity.

The neural net does makes a passable attempt at predicting what hue shift to apply to correct a foreground at values far from the discontinuity, but starts to break down close to it.

We also start wondering if applying a uniform shift in hue to all pixels in the foreground is really a sensible thing to do. Simply rotating the hue of each pixel is, perhaps, an unusual situation to have to correct, and is not necessarily the same as a related, and more common correction, that of removing a colour cast, which is more in the domain of white balance.

## Luminance Correction

With this in mind, I moved onto a more promising line of enquiry: luminance correction. This seemed a more realistic goal both on a conceptual level (it is often obvious when the foreground is too dark or too light) and a practical level (the discontinuity of the modulo function present in hue correction no longer applies).

This time, I began tentatively, building up from the simple problem of matching the luminance of two randomly generated pixels to those of large images.

Unfortunately, a different set of problems soon arose, this time, due to limitations in the training datasets I used.

## Datasets

Consider the ideal dataset for training. Each example would comprise:

• Feature: A picture of a background scene.
• Feature: A picture of a person in different lighting, along with a mask indicating where the person is.
• Label: The same person, in the same pose, inserted into the background scene, with everything else the same, except the luminance of the person has been corrected.

This dataset does not exist. It is unlikely to in the future, and we did not have the time or resources to produce it.

A dataset that does, however, exist, is the second bullet point above: a dataset of pictures of people with their associated masks. This I modified to try to fit our purposes, giving the following dataset:

• Label: a randomly generated float representing a shift in luminance.
• Feature: cut out of the person with the luminance shift applied.
• Feature: the remainder of the image from which the person has been cut out.

We try to train the network to predict the label value given the background and the luminance-shifted foreground.

We might hope that the neural net somehow learns to match the luminance of the foreground with the background.

We hope in vain.

Consider, for a moment, the following example. We have a picture of someone with skis, and are trying to insert them into a picture of a snow-covered ski slope. This throws up a whole tranche of questions and problems:

• A naive approach would slam the luminance of the foreground up to a high value because of the brightness of the white snow behind. But what if the skier is wearing black clothing? This would be an inappropriate correction.
• How do we know if the skier is wearing black clothing? He may be wearing grey, or even white clothes, but the photo is horribly underexposed. To better gauge the overall exposure of the foreground, we would need its surroundings: impossible from our modified dataset as we already use the surroundings as the background image!
• What if the skier is wearing black trousers and a white top? What is the most appropriate luminance shift to apply? Should we apply a single value across the whole foreground or multiple values?

These are questions which a human may find difficult to answer, given the same dataset, so it seems unreasonable to expect a neural net to learn effectively.

Indeed, the effect of the last bullet point was a first-rate example of something going wrong when we tried running the trained network on practical examples.

## Deception

The problems above may appear obvious, but it took some time to arrive at the reasons for the networks' failures.

In the early stages of developing the toy networks operating on a single pixel of data, I was very keen to track the progress of the network whilst it was training to ensure it was heading in the correct direction. This led to a series of graphs such as the following:

Each line on the graph represents a different network architecture or variation of training data. This shows, on the x-axis, the iteration number during training, and on the y-axis, the average $R^2$ correlation value between predicted luminance shifts and the actual luminance shifts (the 'labels') for all values so far in the training. To unpack that a bit, take a look at these four graphs of predicted values against the label values at various stages throughout training:

We see that, near the beginning, there is little correlation between the two, but as training goes on, the correlation, and thus the $R^2$ value, as hoped for, increases. This is reflected in the above graph which, even for the complex examples of real life images, often reached $R^2$ values of $>0.8$, which seemed perfectly reasonable.

So how does such close monitoring result in poor results on a valuation or test set? If you've read this far, you probably know the answer already: overfitting. It would have been sensible to implement a second piece of monitoring: a graph which shows the accuracy of the network used solely for prediction on a set of images not used for training. It is when this graph begins to level off, or decrease, indicating that no more generalisable learning is taking place from the training set, that the training should be stopped.

## Future Work

Overall, trying to train a neural network to perform colour correction with the datasets we had did not work. After a few weeks of attempts, I moved on to working with Dovetailed to implement the user interface, in the form of a web app, and some server-side code. In the end, some colour correction functionality was implemented using standard algorithmic techniques, and not artificial intelligence, as described in Mohammed's section.

I believe, however, that there is still hope. It may be possible to apply other techniques in AI to achieve our desired result. Perhaps taking in more of the semantic meaning of the images used in a dataset could be utilised by the network: one paper to which we referred used this technique in other areas of colour correction. Another promising lead would be to use the recent development of 'Generative Adversarial Networks' (GANs), which have had startling success in many areas of AI image processing.

# Segmentation -- Mohammed

My part of the project focused primarily on the segmentation (cutting out) of humans from an image. As mentioned previously, this project was constrained to use as much machine learning as appropriate. Due to this, my first port of call was to gain a deeper understanding of convolutional neural networks (CNNs). This was one of the current techniques for state-of-the-art image segmentation, as shown by recently published papers regarding models such as MaskRCNN and DeepMask, both of which scored very highly on benchmarks such as ImageNet.

Initially, I took a short online course to gain some knowledge on important concepts such as 'regularisation', 'gradient descent optimisers' and 'convolutional layers'. This provided me with a baseline level of knowledge to design and develop a series of practice runs, the idea being to gain an appreciation of what factors can affect model architecture and accuracy.

Each toy program focused on a different image-related problem, such as the well-known 'MNIST' letter recognition problem and face recognition. Working on small images with fewer features provided a much better starting ground than attempting to head straight to large images, as models trained and evaluated far faster. This resulted in a short turnaround time between models and quickly gave me the understanding required to approach the larger problem of full image segmentation.

## Evaluating Various State-of-the-art Image Segmentation Neural Nets

I tried five CNN image segmentation implementations and qualitatively measured the performance and usability of each network for the given task. This was not an in-depth study into their respective qualities, but instead acted as a bridge to design and develop my own, and hence quantitative data was not obtained (the papers contain sufficient quantitative data for comparison against the respective datasets).

My findings were as follows:

DeepMask (Paper, Implementation). Deepmask performed well for the task at hand. It extracted the instances of the objects correctly and provided the necessary labels for the project. It does not run out of the box, but instead relies on NVIDIA libraries and graphics cards. This wasn't a problem during testing due to the availability of sufficient equipment, but made it less suitable for the task at hand.

SharpMask (Paper, Implementation): SharpMask is an improved version of DeepMask. It performed very well for the task, providing much sharper and more accurate results than DeepMask. It is a little slower due to its higher complexity, but the difference isn't critical ($0.5$s on DeepMask per COCO image vs $0.8$s on SharpMask).

BlitzNet (Paper, Implementation): BlitzNet, although fast, performed poorly in comparison with the other models. It didn't perform very well in cases with overlapping objects, and didn't extract fine-grained information like MaskRCNN did. It included far too much noise around the body for use in this project.

CRFasRNN (Paper, Implementation): CRFasRNN suffered similar issues to BlitzNet in that it performed poorly with overlapping objects. It also produced a lot of jagged noise around the edges of the objects.

MaskRCNN (Paper, Implementation): MaskRCNN performed very well on the small sample set. The implementation used here provided masks for every object, in addition to their bounding boxes, confidence and labels.

MaskRCNN was the model chosen for this project. This was because there was already an accessible implementation in Keras, an excellent API, as well as very good existing results. It also provided additional semantic information with the masks that others did not, giving space for more fine-grained analysis when computing parameters such as position and scale.

## Implementing the API: Colour Correction, Scaling, Positioning, Shadows and Refinement

At this point, we were approaching the halfway mark of our time at Argon Design. The focus for me now was to implement the neural network and combine it with a series of graphics algorithms to produce an acceptable end result.

One goal for me was to design the API such that it was easily extensible, allowing future interns to customise how the software was computing parameters such as scaling. Thus, each parameter's computation is self-contained, and the approaches I took are described below:

Colour correction: Colour correction was determined to play a key role in whether an image was suitably realistic. My approach was to split colour correction into two key areas: luminance correction and white-balancing.

In order to correct luminance, it is important to understand what it is, as well as the factors that influence it. Luminance is a measure of the intensity of light emitted from a surface per unit area. However, this in itself would be an incredibly difficult measure to obtain, especially due to variations between monitors that would need to be taken into account. Thus, it was best to focus on the concept of relative luminance, which would have a consistent yet noticeable effect on the users' perception of colour within the image.

As our computer screens display images using an RGB colour format, it was clear at this point that the various levels of each colour channel would be the controllable influences of luminance in the images. However, the proportion that each channel influences the relative luminance is dependent on how the human eye perceives the wavelength of light emitted for each colour. It turns out that green has almost double the influence of red as two out of every three cones in the eye have relatively high sensitivities to wavelengths commonly associated with green light. Red has double the influence of blue for similar reasons.

As a normalised value to provide a standard measure across any set of images, I decided to use the vector $(0.299, 0.587, 0.114)$ (note the sum to 1) which was referenced in a document by w3 here. To calculate the relative luminance value per pixel, each pixel would compute the dot product of its RGB vector and the above luminance vector.

My initial approach here was to take the average luminance vector of the entire background, and use that as the new luminance value. This suffered from similar issues mentioned in Patrick's post, particularly those mentioned in the ski slope example.

I then decided that, given the knowledge of where the cutout of the person is, I can take the average luminance vector of the background pixels that are to be covered by the person. This approach gave much better results due to better localisation, but still suffered from issues in extreme circumstances.

My third approach was to take the average luminance of the cutout, and then to take the average of that and the value of the local background segment and use that as the new luminance value. This was significantly more successful, even in the average case, as it took into consideration the original lighting and features of the cutout too, as well as limiting any extreme values that could be held by either the background or foreground.

Having achieved a suitable luminance colour correction result, white-balance was then the next important metric to automatically configure.

White balance is the technique of adjusting the colour balance of light so that it appears a neutral white. This is used to compensate for cases where there is something that would change the natural colour of objects, such as the yellowish colour of artificial light or a shadow from shade.

This colour balance is done on a per-channel basis, unlike the luminance value that uses a single scalar value to modify the entire colour. It is often measured in Kelvin, a measure of temperature. This temperature matches a colour output by a black-body radiator at that specific temperature.

The idea behind white balance algorithms is to sample a space that is known in the world to be a certain colour and modify the entire image such that the sampled area matches the known colour. This requires an invariant that we couldn't satisfy; namely having an area of an image with a known colour. This is because the images are being provided by a user with no additional contextual data that a device such as a camera would know when doing an internal white balance.

In order to match the white balance of the foreground with the background without the additional context, I first attempted to simply take the average colour of the background image and use that as the values for an algorithm that modifies the colour per-pixel. This did not work at all, and instead provided an output that was completely white washed or completely blacked out. This meant that certain colours were having a significant weight on the resultant value that needed to be accounted for.

My solution to this problem was to determine the colours that provided these extreme values, and discard them. To do this, I computed the values in the $0..5$ percentile range and those in the $95..100$ range. I then clipped them from the respective colour channels, and finally took the mean. This was vastly more successful, and was used in the final implementation. From here, I took the three colour channel values that were returned and determined the colour temperature value by first converting the colour space from sRGB to XYZ, then to xy, and then finally used the Hernandez 1999 correlated colour temperature algorithm to produce a final result.

Scaling: In comparison to the colour correction techniques implemented, the scaling function is trivial. I manually created a lookup table, where each cell contains a scaling factor that represents the average object's height (from Wikipedia) to the average person's height. These map to the MSCOCO API lookup table of object names. Objects that are to be ignored contain a value of zero as their scaling factor.

Positioning: The general idea behind the repositioning algorithm is to choose a random object and place the person in front or near that object.

Here I use a custom weighted distribution to determine which object to place the person around. This distribution is generated by computing the softmax of a list containing the sizes of all the objects within the image (which can also implicitly give information about the depth of said object within the image). This gives a random variable that can be sampled to obtain the ID of the object.

Once we have the random object, we determine its location in the image and scale the person to be the correct size. We then place the person a few pixels below the centre of the object to provide a more realistic look.

If the random object selected is a human, the algorithm instead tries to place the person around the object instead to avoid obscuring those within the image which in turn provides an automatic way of adding those not present to important photos like a Christmas family reunion.

The method used here is to continuously sample the various bounding boxes with which the sampled object shares a horizontal plane. It starts by sampling to the right, trying to find a gap between two neighbouring bounding boxes that can fit the person in. If there is none to the right, sample again from the object but instead going left. If there is still none, slot the person right next to the left edge of the image.

Shadows: Research also suggested that shadows play a critical role in determining whether an image looks realistic. However, shadows had to be implemented correctly otherwise they would instead have the opposite effect: making the resulting image look completely out of place. Unfortunately, my attempt at shadows was not realistic enough for it to be used in the final product, primarily because the position of the shadows did not match up with the lighting. The semi successful approach I did take is detailed here for the sake of completeness.

The idea was to use a perspective transform of the cutout to generate the shadow.

The first step was to generate an entirely black copy of the cutout. As shadows are not completely uniformly black, I applied a continuous (modelled as a stepwise function with a change small enough to appear continuous) transformation to the cutout based on its y-coordinate to achieve a fade effect. This transformation used a computed value of $\frac{255}{75\%}-height$ to determine the change of colour between the y-values. This was then applied to the cutout, and achieved a result such that the bottom $25\%$ was a sharp black, and the shadow became increasingly lighter as the y-value increased.

I then had to perspective transform the shadow so that it would appear to be lying on the ground. This was done by generating two points in a new plane, each of which was created by rotating a respective top point on the current plane by an arbitrary factor, then diagonally stretching them by an arbitrary factor, and finally translating them each again by a set of arbitrary coordinates.

Finally, as shadows are soft in the real world, I applied a series of Gaussian blurs on the shadow to provide a soft looking result. This, like the fade effect transformation, was done using a stepwise transformation that progressively made the shadow softer as you venture up the image along the y-axis. This was done after merging the shadow with the background image as that captured the underlying colour of the scene in the blur.

Another approach I took to creating shadows was generating a shadow map using an artificial scene and ray tracing. As there is no depth information and both the scenes and foreground are 2D, this did not work despite approximating the dimensions of a person.

Refinement: The resulting cutout of the person from the neural network was not perfect. The neck area in particular majorly suffered from poor bounding as the network would include a large chunk of surrounding background here. To overcome this, I attempted to refine the image using an algorithmic approach.

The algorithmic approach uses the idea that the person is distinct from the background.

We generate a weighted graph of the image where each pixel is a vertex, neighbouring pixels are joined by an edge, and the weight of each edge is the difference of a function of the colour of each pixel. We then run a variation of Kruskal's algorithm that connects two edges if and only if the edge weight is below a predefined threshold. This gives multiple trees, each of which have the property that all vertices have a similar colour. Here we assume that the background has a uniform colour scheme so that we can then filter out all pixels of a weight below a threshold defined as a function of all the weights in the largest tree.

This worked extremely well when we used a test case of a person against a bright green screen. However, in practice the two assumptions made (the background being distinct from the person and the background having a uniform colour scheme) do not hold, and thus the algorithm fails to extract anything useful. The threshold is also difficult to compute - a manually assigned threshold worked very well for the green screen case, but generating the threshold from the weights themselves proved to be unsuccessful.

## Designing a New State-of-the-art CNN to Segment Humans

Having had about nine weeks of hands-on experience with machine learning models and some background reading of various papers on the area, I started designing and developing a machine learning model with a goal to outperform MaskRCNN on segmenting humans from the image (a subset of the task that MaskRCNN was designed to do).

## Building the training set

The first task in creating the model was to clean up the dataset that I was using. The data fed in for predictions (supplied by an end user) would be a single three-channel (RGB) image containing a human figure.

The MSCOCO dataset contains a large number of images from 80 categories. This was a superset of all the images I needed to train on. Due to processing and time constraints, it was necessary to filter out unrelated images.

I used the following pipeline process for working with this dataset:

1. Use the Python COCOAPI to match labels to images and extract an image list of only those that contained labelled human figures.
2. Use Pillow to determine the number of channels in each image, discarding those that have fewer than three channels. In the case of four-channel images, use OpenCV to flatten the alpha channel.
3. After a few trial runs of the model below, I also filter images out that don't contain prominently placed people within the image (measured at about 20% of the total image size).
4. Create a DataGenerator class (a subclass of keras.utils.Sequence) which acts as a generator, yielding processed images in a random order and masks to the consumer. The processing includes resizing all images and masks to a consistent size (512x512) and applying random transformations including horizontal flips, random crops, gaussian blurs, contrast normalisation, gaussian noise and random colour correction to each channel and affine transforms such as scale, rotation and shear.
5. Plot the images and masks using matplotlib. This helps identify any processing errors made in the above steps.

## Model design and outcomes

The model I designed was based on a UNet (Paper) architecture, due to its simplicity and prediction speed. The model uses a series of convolutional, skip, max pooling and up-sampling layers to generate a per-pixel probability map. The architecture is shown in the diagram below:

The number of filters on the convolutional layers is constant along each row, and increases by a power of two as the image size decreases.

My loss function for this model was the binary cross entropy of the result and mask summed with the dice coefficient. This is very similar to the Mean Intersection over Union (MIoU) measure used in a wide variety of papers, but penalises a single bad output less than a MIoU metric would.

Also included is an early stopping mechanism to prevent overfitting. This stops training if the loss does not decrease by at least 0.0001 in two epochs.

The results for this are as follows:

Model parameters
Dice Coefficient
512x512, starting at 4 filters 0.53
512x512, starting at 8 filters 0.59
512x512, 32 filters 0.67
512x512, 64 filters N/A, out of memory
512x512, 32 filters, Dropout 0.73
1024x1024, 16 filters 0.65
1024x1024, 16 filters, Dropout 0.69
256x256, 64 filters 0.64
256x256, 64 filters, Dropout
0.71

Note that the results may be different on future runs of the same architecture due to the randomisation of transformations and order. These are single run results, not averages.

## Thoughts and lessons learned

Although some of the models had good dice coefficients, there was a fundamental problem with all the results that the models were producing. Although the outline of the person was somewhat clear, there was a large gap missing within the person's body. The position of this gap was random, based on the properties of each image. This meant that the model was not suitable yet for use in the project, as realism dictates that it's better to have slightly too much of the person and their background than a gaping hole in their body.

A few things I'd do differently if I were to design a neural network model for a similar project:

• Don't train on the test set! Despite it being one of the core mantras in developing models, it is far too easy to accidentally let features from the test set be implicitly learned by the model. This can arise from playing around with hyper parameters and changing them such that they increase the test set score. I realised this was happening a little too late in model creation, and as a result it caused a lot of overfitting.
• Select a series of transformations that are more appropriate to the use case. It's unlikely you'll ever have a supplied image of an rotated and distorted human (although in the case that you do, the more generalised model will be able to somewhat cope). This is a case for trading generalisability for performance in a specific case. This goes slightly against the mantra of not overtraining a model. It all comes down to the use case of the model.
• Try a considerably larger number of architectures. There's a somewhat large amount of hyperparameter and layer guess work when designing these models, and so it's important to try a number of architectures to see which models capture the data in the best way possible.
• Play around with a custom dataset if possible. This is probably the most difficult task. Creating a sufficiently large dataset takes up a large amount of time and resources that aren't always available. The issue with the MSCOCO dataset is that the images and masks themselves had jagged areas around the neck. These jagged areas were a primary reason for attempting to build a model that can correct for them. The premise is flawed however when the model is learning data with that feature.