CN110070483B

CN110070483B - Portrait cartoon method based on generation type countermeasure network

Info

Publication number: CN110070483B
Application number: CN201910235651.1A
Authority: CN
Inventors: 曾坤; 区炳坚; 周凡
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2023-10-20
Anticipated expiration: 2039-03-26
Also published as: CN110070483A

Abstract

The invention discloses a portrait cartoon method based on a generated type countermeasure network. The invention divides the face and the background through a generating type network, converts the face into the cartoon face, converts the background into the cartoon background, synthesizes the cartoon background to obtain a cartoon picture, and judges the cartoon picture by a judging type network; training the generating type network and the discriminant type network through a loss function; and finally, inputting the face image to be processed into a trained generation type network to generate a corresponding cartoon image. The invention is beneficial to fully automatically generating the portrait cartoon picture according to the input face picture or providing a recommended cartoon scheme according to the face picture input by the user, so that the user can select or modify the cartoon scheme and the material splicing time selected by the user can be saved.

Description

Portrait cartoon method based on generation type countermeasure network

Technical Field

The invention relates to the field of computer vision, in particular to a portrait cartoon method based on a generated type countermeasure network.

Background

And (3) the cartoon of the human face, namely taking the human face picture as input to obtain the corresponding cartoon human face with personal characteristics. The technology has wider application scenes and significance. In recent years, a lot of applications such as "face lovely", "magic diffusion cameras" have also emerged, which are popular. However, face cartoon is a complex problem related to multiple image fields, and it is challenging to quickly and automatically generate a cartoon with portrait characteristics.

The difficulty of face cartoon is as follows: many cartoon styles are designed to maintain the style of hair, face, etc., while changing the structure such as eyes, nose, etc., while removing details such as hair, facial skin, etc., and maintaining the overall color. That is, the face cartoonization involves a change of part of the structure, a removal of detail texture and a retention of color, which are very different from the traditional style conversion, and some classical style conversion algorithms cannot be used on the problem.

In the portrait cartoon field, commercial products mainly adopt a material splicing method. Taking 'face lovely' as an example, a great number of cartoon picture materials are required to be drawn in advance by a painter, the placement positions of the materials are also fixed, and the materials are required to be manually selected by a user for splicing so as to form a cartoon picture. Therefore, similar commercial software cannot automatically generate the portrait cartoon, and the effect is limited by the fixed materials, so that the cartoon similar to a human face is not generated.

In the scientific research field, a great part of full-automatic cartoon generation method capable of being classified as a cartoon generation method based on components mainly comprises the steps of cutting structures such as facial features, hairs and the like, matching the closest materials in a material library, and fusing the materials into a cartoon head portrait through deformation and other treatments. Such methods first require the preparation of a large amount of hair, face, facial, etc., which requires a lot of time and expertise. The finally generated cartoon head portrait is hard, and the structures of all parts of the cartoon faces in the material are required to be similar to the structures of the faces.

In recent years, generation of a contrast network (GAN) has gained great attention in cross-domain image transformation, and provides a new idea for image cartoon. At present, the research of using GAN to carry out face cartoon is less, and the problems are that: (1) In the study of using GAN to carry out cartoon, standard CycleGAN or its variant is often used, and CycleGAN has good effect in the scene of converting horses into zebra pictures, daytime into night pictures and the like, but has poor performance effect when the structural difference of two domains is large, such as the face becomes cartoon. (2) Many research effects of using GAN to carry out facial cartoon can only generate a picture like a cartoon head portrait, personal characteristics are lost, and effects on hairstyle, facial form and color maintenance are poor.

One existing method is Domain Transformation Network (DTN). For images of two domains (face picture and cartoon face picture), the generated model G contains two networks: the pre-trained encoder f and decoder g send the face picture x into f, extract a code f (x) with high-level semantic information of the input image, and the code may contain face information such as hairstyle, color, etc. The decoder g is responsible for decoding the previously obtained code into a cartoon. Judging whether the input picture is a cartoon picture in the sample or not by the judging model D, and generating the cartoon picture by the generating model. Unlike conventional GAN, the network also inputs the picture y from the cartoon head into the generated model, resulting in a generated cartoon graph G (y), which is bounded by the loss function G (y) to be consistent with the input cartoon graph y. In addition, f (G (x)) is limited to be consistent with f (x) through a loss function, so that the cartoon graph G (x) generated according to the face picture x is ensured to keep high-level semantic features of x.

The disadvantage of this method is that:

it can be seen from the DTN solution that for images of two domains, the same encoder f is used to extract the high-level semantic information of the images, which limits the difference between the images of the two domains not to be too large. Through experiments, when the images of two domains are greatly different, namely, the cartoon style deviates from the human style, the method cannot achieve satisfactory effect.

In addition, the input face image and the generated cartoon image are kept consistent in high-level semantic information by limiting, so that the generated image is very weak in supervision, and the color, the face shape, the hairstyle and the like of the generated image cannot always be kept consistent with the input face image, namely the individuation effect is not strong.

Disclosure of Invention

The invention aims to overcome the defects of the existing method and provides a portrait cartoon method based on a generated type countermeasure network. The problems solved by the invention are mainly two: firstly, only the consistency of high-level semantics is maintained, so that the generated cartoon head portrait loses the personalized features, and the semantic information of lower layers such as hairstyles, facial forms, color, skin colors and the like cannot be well maintained; secondly, the existing work converts the face segmentation, the face conversion into the cartoon face and the background conversion into the background in the cartoon picture, and the three subtasks are processed by the same network, so that the quality of the generated cartoon picture is poor.

In order to solve the above problems, the present invention provides a portrait cartoon method based on a generated type countermeasure network, the method comprising:

step one, acquiring a face data training set and a cartoon data training set;

step two, preprocessing a human face data training set to obtain a hair mask, a face mask and a five sense organs mask;

step three, constructing a generating network, converting face images in a face data training set into cartoon images, and converting the cartoon images in the cartoon data training set into the face images;

step four, constructing a discriminant network, and respectively judging the converted cartoon image and the converted face image;

step five, calculating a loss function value and optimizing the generation type network and the discriminant network according to the mask generated in the step two, the face and the cartoon image generated in the step three and the discriminant result obtained in the step four;

step six, repeating the step three to the step five, and circularly iterating for a plurality of rounds to obtain a trained cartoon image generation network;

and step seven, inputting the face image to be processed into the finally obtained cartoon generation type network, so that a corresponding cartoon image with personal characteristics can be obtained.

Preferably, the step of converting the face image in the face data training set into the cartoon image specifically includes:

acquiring a foreground mask using a segmentation network;

encoding the face image using an encoding network;

decoding the face image code by using a background decoding network to obtain a face image background;

decoding the face image code by using a foreground decoding network to obtain a face image foreground;

and obtaining the generated cartoon graph by using the foreground mask, the face image background and the face image foreground.

Preferably, the step of discriminating the converted cartoon image specifically includes:

sending the generated cartoon image and the cartoon image sampled from the cartoon data training set to a cartoon image original image discrimination network, wherein the network is used for judging whether the input image is a non-generated cartoon image sample;

and obtaining an edge map of the cartoon map generated and the edge map of the cartoon map sampled in the cartoon data training set, and judging the edge map by using an edge judging network.

The invention provides a portrait cartoon method based on a generated type countermeasure network, which solves the common problems in the prior face cartoon work by using GAN: the bottom semantic information cannot be kept and the effect of generating the cartoon head is poor. The method is beneficial to automatically generating the portrait cartoon picture according to the input face picture or providing a recommended cartoon scheme according to the face picture input by the user, so that the user can select or modify the cartoon scheme, and the time for splicing materials selected by the user is saved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a general flow chart of a portrait cartoonization method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a generative network of an embodiment of the present invention;

fig. 3 is a block diagram of a discriminant network in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a general flow chart of a portrait cartoonization method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s1, acquiring a face data training set and a cartoon data training set;

s2, preprocessing a human face data training set to obtain a hair mask, a face mask and a five-sense organ mask;

s3, constructing a generating network, converting face images in the face data training set into cartoon images, and converting the cartoon images in the cartoon data training set into the face images;

s4, constructing a discriminant network, and respectively judging the converted cartoon image and the converted face image;

s5, calculating a loss function value and optimizing a generating network and a discriminant network according to the mask generated in the S2, the face and cartoon image generated in the S3 and the discriminant result obtained in the S4;

s6, repeating the steps S3 to S5, and circularly iterating for a plurality of rounds to obtain a trained cartoon image generation network;

s7, inputting the face image to be processed into the finally obtained cartoon generation type network, and obtaining a corresponding cartoon image with personal characteristics.

Step S1, specifically, the following steps are performed:

the aligned face data is obtained from the disclosed aligned face data set CelebA to construct a face training set, and cartoon head portrait pictures are obtained by a web crawler or a mode of using a public data set to serve as a cartoon training set, wherein the cartoon data set disclosed by Google is used in the embodiment.

Step S2, specifically, the following steps are performed:

s2-1, inputting the input face picture into a pre-trained semantic segmentation network to obtain rough hair and face masks. The pre-training network structure in this embodiment uses a semantic segmentation model proposed by google, training data from a human component profile dataset;

s2-2, extracting face characteristic points of an input face by using an active shape model (Active Shape Model, ASM), and respectively calculating convex hulls of eyebrow, eye, nose and mouth characteristic points as masks of the face characteristic points;

s2-3, because of a plurality of cartoon styles, compared with a real human face, larger deformation occurs at the positions of eyes, noses, mouths and the like, a convex hull of the characteristic points of the eyes, the noses and the mouths is calculated to obtain a face structure change area, the face structure change area is marked as Aface1, the rough face mask obtained in the S2-1 is used for marking the area, from which the Aface1 is removed, as Aface2, and the Aface2 is usually the part of which the structure does not need to be changed and the color needs to be maintained. Finally, the hair area obtained in S21 is denoted as ahai.

Step S3, specifically, the following steps are performed:

the two network structures are symmetrical, namely a generating network for converting the human face into a personalized cartoon image and a generating network for converting the cartoon image into the human face. The following mainly describes a generation flow of converting a face image into a cartoon image, a network structure is shown in fig. 2, and an output personalized cartoon image is obtained from an input face image x, which needs to be subjected to the following steps:

s3-1, the input face diagram is marked as x, and the cartoon diagram is marked as y when training GAN. x is used as an input to a segmentation network G1attn for segmenting the foreground to obtain a foreground mask, the output G1attn (x) is a mask whose size corresponds to the size of the input picture, the number of channels is 1, each pixel has a value between 0 and 1, and the closer to 0, the higher the likelihood that the pixel is background, the closer to 1, the higher the likelihood that the pixel is face or hair. So that the background mask can then be represented as 1-G1attn (x);

s3-2, simultaneously, the face input image x is also input into a face feature coding network (marked as e 1) to obtain a high-level semantic information coding vector e1 (x);

s3-3, using the high-level semantic coding vector obtained in the S3-2 as an input of a cartoon background decoding network (denoted as d1 bg), hopeing that d1bg is focused on generating a background similar to the background in the cartoon data set, and not focusing on the generation of cartoon faces;

s3-4, using the high-level semantic coding vector obtained in the S3-2 as input of a cartoon face decoding network (denoted as d1 cnt), hopeing that d1cnt is focused on generating faces in the cartoon data set, and not focusing on background generation of the cartoon data set;

s3-5, obtaining a finally generated cartoon chart by using the face and hair mask and the background mask obtained in the S3-1 and the better cartoon background and the generation result of the cartoon face obtained in the S3-3 and the S3-4:

G1(x)＝G1attn(x)⊙d1cnt(e1(x))+(1-G1attn(x))⊙d1bg(e1(x))

wherein, as indicated by ";

s3-6, the structure of the discrimination model from cartoon to face is similar, and the generated model G2 comprises a segmentation network G2attn, a coding network e2, a background decoding network d2bg and a face decoding network d2cnt, which are not described in detail herein.

Step S4, as shown in fig. 3, is specifically as follows:

and S4-1, sending the generated cartoon graph obtained in the step S3-5 and the cartoon graph sampled from the cartoon data training set into a cartoon graph original graph distinguishing network D1, wherein the network is used for judging whether the input image is a non-generated cartoon graph sample.

And S4-2, in addition, the color of the generated cartoon is also influenced by the color distribution of the cartoon data set, and the generated network is difficult to generate the colors which are not in the cartoon data set, because the discrimination network classifies the pictures according to the color distribution of the cartoon data set, so that the color generated by the generated network is close to the color of the cartoon data set. Therefore, during training, a new cartoon graph discrimination network D1Edg is added by reducing the influence weight of the cartoon graph discrimination model, and the discrimination network takes the cartoon edge graph after passing through an edge extraction network (EdgeExtraNet) as input. In this embodiment, the EdgeExtraNet network is composed of two convolution layers, where a first convolution layer is used to obtain a gray scale map, and the other convolution layer is used for edge extraction, and the convolution kernel is an edge operator in the image processing field, and parameters of the convolution kernel are as follows:

s4-3, the face image discriminator network is composed of an original image discriminating network, and unlike the cartoon image discriminator, an edge discriminating network is not needed.

Step S5, specifically, the following steps are performed:

s5-1, calculating the color loss. In the process of cartoon at human edge, in order to keep low-level semantic information such as color development, skin color and the like, the generated cartoon graph obtained in the input image x and S3-5 is input into a smoothing network (SmoothNet), and in the invention, a smoothing network is simply formed by using convolution layers with the same 20 parameters, and the parameters of the convolution kernel are as follows:

according to the data preprocessing module in S-2 we have obtained a rough hair mask ahai and a face part Aface2 that removes nose, eyes and mouth areas, two areas being areas where the structure and color do not need to be changed much, for which the loss is calculated:

L _color ＝||SmoothNet(x)⊙(Ahair+Aface2)-SmoothNet(G1(x))⊙(Ahair+Aface2)|| ₁

s5-2, calculating the cycle consistency loss, referring to the CycleGAN, hopefully generating a cartoon graph G1 (x) through G1, and recovering the cartoon graph to x after the cartoon graph G2 is blocked to a human body. In the application of changing a human face into a cartoon, when the human face is changed into a cartoon, that is, x is converted into G1 (x), usually a lot of detail information, especially a human face photo with relatively more hairs and relatively larger background parts, is lost, and when the human face photo is converted into a cartoon picture, the texture of the hair is lost, and a lot of information is lost after the background of the human face photo is converted. Therefore, when the face photo is restored by using G2 (x), the information is almost impossible and unnecessary to be completely restored, and for this purpose, in the cycle that the person becomes cartoon and is restored back to the person, the cycle consistency loss expression of the invention is as follows:

L _cyc1 ＝||x⊙(Aface1+Aface2)-G2(G1(x))⊙(Aface1+Aface2)|| ₁ +||SmoothNet(x)⊙Ahair-SmoothNet(G2(G1(x)))⊙Ahair|| ₁

for cartoon input diagram y, we pay more attention to the restoration of human face and hair color in this cycle of the person changing to cartoon and then reverting back to person. In the cycle of converting the cartoon into the human and then recovering the cartoon, as the cartoon is usually converted into the human, a certain amount of background information, hair textures and the like are added, and the human face recovery is realized only by deleting the added information, the cycle consistency loss of the CycleGAN can be used:

L _cyc2 ＝||y-G1(G2(y))|| ₁

s5-3, calculating mask circulation loss, wherein the foreground region of the face photo is consistent with the foreground region of the cartoon in the cartoon process, so that the mask circulation loss is defined as follows:

L _cycattn1 ＝||G1attn(x)-G2attn(G1(x))|| ₁

mask cycle consistency L during cartoon human change _cycattn2 And L is equal to _cycattn1 The definition is similar.

S5-4, calculating mask supervision loss, wherein the rough mask obtained by the preprocessing module S2 can provide a certain degree of supervision for the mask:

L _msksup1 ＝||G1attn(x)-(Ahair+Aface2+Aface1)|| ₁

similarly, if a method is available, semantic segmentation can be performed on the foreground and the background of the cartoon data set, and mask supervision loss can be added to the cartoon data set, so that the segmentation network can be helped to better generate a correct mask.

S5-5, calculating GAN countermeasures, wherein the generation model is expected to generate pictures which can deceive the discrimination model, the discrimination model can correctly distinguish which images are generated by the generation model, X is set as face image distribution, Y is set as cartoon image distribution, and the optimization targets are as follows:

wherein θ is _G1 Representing parameters, θ, in the G1 model _D1 Representing parameters in model D1, the corresponding GAN challenge loss from the cartoon to the face is similar.

S5-6, calculating GAN gradient map countermeasures against loss, wherein in the process of changing a person into a cartoon, a cartoon gradient map discriminator is added, so that a generator model focuses more on the structure of the cartoon rather than the color, and the optimization target is as follows:

wherein θ is _D1Edg Is a parameter of the discrimination model D1 Edg.

S5-7, the final loss function value is the linear combination of the results from S5-1 to S5-7, the generating type network is fixed firstly, the discriminator network is optimized through back propagation, then the discriminator network is fixed, and the generating type network is optimized.

Step S6, specifically, the following steps are performed:

repeating S3-S5, and in the embodiment, carrying out loop iteration for 200 rounds to obtain the trained cartoon image generation type network.

The portrait cartoon method based on the generated type countermeasure network solves the common problems in the prior face cartoon work by using GAN: the bottom semantic information cannot be kept and the effect of generating the cartoon head is poor. The method is beneficial to automatically generating the portrait cartoon picture according to the input face picture or providing a recommended cartoon scheme according to the face picture input by the user, so that the user can select or modify the cartoon scheme, and the time for splicing materials selected by the user is saved.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

In addition, the foregoing describes in detail a portrait cartoon method based on a generated type countermeasure network, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the description of the foregoing examples is only for helping to understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A portrait cartoon method based on a generated type countermeasure network, which is characterized in that the method comprises the following steps:

step one, acquiring a face data training set and a cartoon data training set;

step seven, inputting the face image to be processed into the finally obtained cartoon generation network, so that a corresponding cartoon image with personal characteristics can be obtained;

the construction generation type network is used for converting face images in a face data training set into cartoon images, and specifically comprises the following steps:

the input face drawing is marked as x, and the cartoon drawing used in training GAN is marked as y; x is taken as an input of a segmentation network G1attn, the network is used for segmenting a foreground to obtain a foreground mask, the output G1attn (x) is a mask, the size of the mask is consistent with the size of an input picture, the number of channels is 1, the value of each pixel is between 0 and 1, the closer to 0 is the higher the possibility that the pixel is a background, the closer to 1 is the higher the probability that the pixel is a face or hair is; whereby the background mask is denoted as 1-G1attn (x);

meanwhile, the face image x is also input into a face feature coding network e1 to obtain a high-level semantic information coding vector e1 (x);

using the high-level semantic information coding vector e1 (x) as an input of a cartoon background decoding network d1bg, wherein d1bg focuses on generating a background similar to the background in a cartoon data set, and does not focus on generating a cartoon face;

using the high-level semantic information coding vector e1 (x) as an input of a cartoon face decoding network d1cnt, wherein d1cnt focuses on generating faces in the cartoon data set, but does not focus on background generation of the cartoon data set;

obtaining a finally generated cartoon chart by using the face and hair mask and the background mask obtained from the G1attn (x) and the cartoon background and cartoon face generating results obtained from the d1bg and d1 cnt:

G1(x)＝G1attn(x)⊙d1cnt(e1(x))+(1-G1attn(x))⊙d1bg(e1(x))；

the ". Iy represents pixel-wise multiplication;

the construction of the discriminant network is characterized in that the discriminant network is constructed to respectively discriminate the converted cartoon image and the converted face image, and the discriminant network is specifically as follows:

sending the generated cartoon image and the cartoon image sampled from the cartoon data training set into a cartoon image original image discrimination network D1, wherein the network is used for judging whether the input image is a non-generated cartoon image sample;

during training, the influence weight of the cartoon graph discrimination model is reduced, and a new cartoon graph discrimination network D1Edg is added, wherein the discrimination network D1Edg takes a cartoon edge graph after edge extraction network EdgeExtraNet as input; the EdgeExtraNet network consists of two convolution layers, wherein the first convolution layer is used for obtaining a gray level image, the other convolution layer is used for extracting edges, a convolution kernel is an edge operator in the field of image processing, and parameters of the convolution kernel are as follows:

the human face image discriminator network is composed of an original image discriminating network, and unlike a cartoon image discriminator, an edge discriminating network is not needed;

the loss function value calculation and optimization generation type network and discriminant network specifically comprise:

calculating the color loss: in the process of changing a person into a cartoon, in order to keep low-level semantic information, an input image x and a generated cartoon graph are input into a smoothing network smoothNet, a smoothing network is formed by using convolution layers with the same 20 parameters, and the convolution kernel parameters are as follows:

from the rough hair mask ahai obtained by the pre-treatment, and the face part Aface2 of the nose, eye and mouth regions, two regions are regions for which the structure and color do not need to be changed much, the loss is calculated:

L _color ＝||SmoothNet(x)⊙(Ahair+Afce2)-SmoothNet(G1(x))⊙(Ahair+Aface2)|| ₁ ；

calculating a loop consistency loss: in this cycle of the person becoming cartoon and being restored back to the person, the cycle consistency loss expression is as follows:

for cartoon input diagram y, in the cycle of converting cartoon into human and then reducing cartoon back, the cycle consistency loss of CycleGAN is used:

L _cyc2 ＝||y-G1(G2(y))|| ₁ ；

calculating mask circulation loss: to keep the foreground region of the face photo consistent with the foreground region of the generated cartoon, the mask cycle penalty is defined as follows:

L _cycattn1 ＝||G1attn(x)-G2attn(G1(x))|| ₁ ；

mask cycle consistency L during cartoon human change _cycattn2 And L is equal to _cycattn1 Definition is similar;

calculating mask supervision loss:

L _{msk sup1} ＝||G1attn(x)-(Ahair+Aface2+Aface1)|| ₁ ；

calculation of GAN fight loss: hope that the generating model can generate the picture which can deceive the discriminating model, and the discriminating model can correctly distinguish which images are generated by the generating model, let X be the face image distribution, Y be the cartoon image distribution, and optimize the goal as:

wherein θ is _G1 Representing parameters, θ, in the G1 model _D1 Representing parameters in model D1, corresponding GAN fight loss from cartoon to face being similar;

the GAN gradient map is calculated against loss: in the process of changing the cartoon, a cartoon gradient map discriminator is added, so that the generator model focuses more on the structure of the cartoon rather than the color, and the optimization targets are as follows:

wherein θ is _D1Edg Parameters for the discrimination model D1 Edg;

the final loss function value is the linear combination of all the loss results, the generating type network is fixed firstly, the discriminator network is optimized through back propagation, then the discriminator network is fixed, and the generating type network is optimized.