Character-level seq2seq model for translation and beam search. #56337

VallabhMahajan1 · 2022-06-02T19:23:53Z

Click to expand!

Issue Type

Documentation Feature Request

Source

source

Tensorflow Version

tf 2.8

Custom Code

Yes

OS Platform and Distribution

Colab GPU

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

I was trying to implement seq2seq translation model at character level along with beam search by referring the tensorflow documentation.

https://colab.research.google.com/github/tensorflow/addons/blob/master/docs/tutorials/networks_seq2seq_nmt.ipynb

For this, I changed only one parameter -- 'char_level = True' in tf.keras tokenizer. There was no issue during model training, but I'm getting error for inferences.

Standalone code to reproduce the issue

# Step 3 and Step 4
    def tokenize(self, lang):
        # lang = list of sentences in a language
        
        # print(len(lang), "example sentence: {}".format(lang[0]))
        lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', char_level = True)
        lang_tokenizer.fit_on_texts(lang)

        ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
        ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
        tensor = lang_tokenizer.texts_to_sequences(lang) 

        ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
        ## and pads the sequences to match the longest sequences in the given input
        tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

        return tensor, lang_tokenizer

Relevant log output

translate(u'hace mucho frio aqui.')

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-19-a7e085e16f9f> in <module>()
----> 1 translate(u'hace mucho frio aqui.')

2 frames
<ipython-input-17-d0c0d138384e> in <listcomp>(.0)
      2   sentence = dataset_creator.preprocess_sentence(sentence)
      3 
----> 4   inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
      5   inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
      6                                                           maxlen=max_length_input,

KeyError: '<start>'

tilakrayal · 2022-06-03T10:01:11Z

Hi @VallabhMahajan1,
I was able to execute the given code without any issues. Kindly find the gist of it here.

Could you share a reproducible code that supports your statement so that the issue can be easily understood? Thank you!

VallabhMahajan1 · 2022-06-03T14:27:15Z

I changed this one parameter -- 'char_level = True' in tf.keras tokenizer.

class NMTDataset:

def __init__(self, problem_type='en-spa'):
    self.problem_type = 'en-spa'
    self.inp_lang_tokenizer = None
    self.targ_lang_tokenizer = None


def unicode_to_ascii(self, s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

## Step 1 and Step 2 
def preprocess_sentence(self, w):
    w = self.unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

def create_dataset(self, path, num_examples):
    # path : path to spa-eng.txt file
    # num_examples : Limit the total number of training example for faster training (set num_examples = len(lines) to use full data)
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    word_pairs = [[self.preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

def convert_list_to_string(sentences):
    text = "" 
    for s in sentences: 
      text += s + " " 
    return text

# Step 3 and Step 4
def tokenize(self, lang):
    # lang = list of sentences in a language
    
    print(len(lang), "example sentence: {}".format(lang[0]))
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', char_level = True)
    lang_tokenizer.fit_on_texts(lang)

    ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
    ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
    tensor = lang_tokenizer.texts_to_sequences(lang) 

    ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
    ## and pads the sequences to match the longest sequences in the given input
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    return tensor, lang_tokenizer

def load_dataset(self, path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = self.create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = self.tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = self.tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

def call(self, num_examples, BUFFER_SIZE, BATCH_SIZE):
    file_path = download_nmt()
    input_tensor, target_tensor, self.inp_lang_tokenizer, self.targ_lang_tokenizer = self.load_dataset(file_path, num_examples)
    
    input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

    train_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train))
    train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

    val_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val))
    val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)

    return train_dataset, val_dataset, self.inp_lang_tokenizer, self.targ_lang_tokenizer

chunduriv · 2022-06-10T15:23:35Z

@VallabhMahajan1,

Here the tutorial describes Neural Machine Translation(NMT) using word level sequence-to-sequence model using TF Addons. Please refer to this tutorial for character level sequence-to-sequence using Tensorflow.

google-ml-butler · 2022-06-17T15:32:18Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2022-06-24T16:21:32Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler bot added type:docs-feature Doc issues for new feature, or clarifications about functionality type:feature Feature requests labels Jun 2, 2022

google-ml-butler bot assigned tilakrayal Jun 2, 2022

tilakrayal added the comp:apis Highlevel API related issues label Jun 3, 2022

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jun 3, 2022

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 3, 2022

tilakrayal assigned chunduriv and unassigned tilakrayal Jun 6, 2022

chunduriv added the stat:awaiting response Status - Awaiting response from author label Jun 10, 2022

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 17, 2022

google-ml-butler bot closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character-level seq2seq model for translation and beam search. #56337

Character-level seq2seq model for translation and beam search. #56337

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

Character-level seq2seq model for translation and beam search. #56337

Character-level seq2seq model for translation and beam search. #56337

Comments

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output