Imagine that images could help create texts that the buyers can’t refuse,
– texts that help guarantee higher quality including 100% consistency and targeting to the right buyers. Well, maybe this will not only remain in our imagination for long..
At Esoft, we have a highly skilled research and development team researching image enhancement and text generation. In this article we will give a short introduction to our work with Natural Language generation, primarily done by our master student, Juan, who’s main focus is automating parts of the text generation process done by copywriters.
The future lies in Natural Language Generation
Today, our copywriters are given a set of images, instructions and others sources of information from which they can collect the important features of a given house and generate an engaging description of it.
But what about building a model that writes a description of a house automatically?
We already know systems that caption images accurately using huge datasets with thousands of annotated images. The architecture they used is similar to an architecture usually used for machine translation which consists of two Recurrent Neural Network blocks: the encoder and decoder. In broad terms, the encoder receives an input that represents a text sequence and generates a fixed-length vector representation which is then decoded into a text sequence of a different language.
Systems like this one benefit of this encoder/decoder modular approach, thus, they plug a different encoder which now is composed of a convolutional neural network CNN. Now, this time, the CNN encoder and the RNN decoder are trained together. These systems are conceptually simple but really hard to train, for instance, the coco dataset contains 200K labeled images with 5 captions each and the encoder was pretrained with imagenet, a dataset that consists of 14 millions of pictures manually classified.
Splitting the pipeline into independent modules
In Juan’s master project, he is taking a similar approach but split the pipeline in independent modules. The process comprises three stages.
The first stage receives a floorplan image as an input and builds a graph representation. For this step we use two different neural network models: YoloV3 for object detection and U-net for Image segmentation. For YoloV3, we have defined an architecture based on YoloV3-tiny and we are detecting objects in a floorplan like doors, windows, and texts. The walls are detected through segmentation, and then we only need to mix everything together to build a graph representation and apply OCR to the text.
The second module receives the graph, that can be enriched with data given by the user e.g. “the kitchen has been renovated recently”, and performs an smart partition, selects the names, verbs and adjectives that best represent each partition and creates a list of meaning representations. These meaning representations are then queued to the writer which uses computational linguistics to generate the final statements.
The final text representation will generate statements describing the number of rooms or bathrooms, how the different rooms are connected, if a particular room has access to a balcony, a terrace and so on.
Written by Juan Francisco Marin Vega
. . .
Connect with Juan here:
- LinkedIn: https://www.linkedin.com/in/hermetico/