[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character-level seq2seq model for translation and beam search. #56337

Closed
VallabhMahajan1 opened this issue Jun 2, 2022 · 5 comments
Closed
Assignees
Labels
comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:docs-feature Doc issues for new feature, or clarifications about functionality type:feature Feature requests

Comments

@VallabhMahajan1
Copy link
VallabhMahajan1 commented Jun 2, 2022
Click to expand!

Issue Type

Documentation Feature Request

Source

source

Tensorflow Version

tf 2.8

Custom Code

Yes

OS Platform and Distribution

Colab GPU

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

I was trying to implement seq2seq translation model at character level along with beam search by referring the tensorflow documentation.

https://colab.research.google.com/github/tensorflow/addons/blob/master/docs/tutorials/networks_seq2seq_nmt.ipynb

For this, I changed only one parameter -- 'char_level = True' in tf.keras tokenizer. There was no issue during model training, but I'm getting error for inferences.

Standalone code to reproduce the issue

# Step 3 and Step 4
    def tokenize(self, lang):
        # lang = list of sentences in a language
        
        # print(len(lang), "example sentence: {}".format(lang[0]))
        lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', char_level = True)
        lang_tokenizer.fit_on_texts(lang)

        ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
        ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
        tensor = lang_tokenizer.texts_to_sequences(lang) 

        ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
        ## and pads the sequences to match the longest sequences in the given input
        tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

        return tensor, lang_tokenizer

Relevant log output

translate(u'hace mucho frio aqui.')

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-19-a7e085e16f9f> in <module>()
----> 1 translate(u'hace mucho frio aqui.')

2 frames
<ipython-input-17-d0c0d138384e> in <listcomp>(.0)
      2   sentence = dataset_creator.preprocess_sentence(sentence)
      3 
----> 4   inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
      5   inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
      6                                                           maxlen=max_length_input,

KeyError: '<start>'
@google-ml-butler google-ml-butler bot added type:docs-feature Doc issues for new feature, or clarifications about functionality type:feature Feature requests labels Jun 2, 2022
@tilakrayal tilakrayal added the comp:apis Highlevel API related issues label Jun 3, 2022
@tilakrayal
Copy link
Contributor

Hi @VallabhMahajan1,
I was able to execute the given code without any issues. Kindly find the gist of it here.

Could you share a reproducible code that supports your statement so that the issue can be easily understood? Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jun 3, 2022
@VallabhMahajan1
Copy link
Author
VallabhMahajan1 commented Jun 3, 2022

I changed this one parameter -- 'char_level = True' in tf.keras tokenizer.

class NMTDataset:

def __init__(self, problem_type='en-spa'):
    self.problem_type = 'en-spa'
    self.inp_lang_tokenizer = None
    self.targ_lang_tokenizer = None


def unicode_to_ascii(self, s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

## Step 1 and Step 2 
def preprocess_sentence(self, w):
    w = self.unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

def create_dataset(self, path, num_examples):
    # path : path to spa-eng.txt file
    # num_examples : Limit the total number of training example for faster training (set num_examples = len(lines) to use full data)
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    word_pairs = [[self.preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

def convert_list_to_string(sentences):
    text = "" 
    for s in sentences: 
      text += s + " " 
    return text

# Step 3 and Step 4
def tokenize(self, lang):
    # lang = list of sentences in a language
    
    print(len(lang), "example sentence: {}".format(lang[0]))
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', char_level = True)
    lang_tokenizer.fit_on_texts(lang)

    ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
    ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
    tensor = lang_tokenizer.texts_to_sequences(lang) 

    ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
    ## and pads the sequences to match the longest sequences in the given input
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    return tensor, lang_tokenizer

def load_dataset(self, path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = self.create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = self.tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = self.tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

def call(self, num_examples, BUFFER_SIZE, BATCH_SIZE):
    file_path = download_nmt()
    input_tensor, target_tensor, self.inp_lang_tokenizer, self.targ_lang_tokenizer = self.load_dataset(file_path, num_examples)
    
    input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

    train_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train))
    train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

    val_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val))
    val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)

    return train_dataset, val_dataset, self.inp_lang_tokenizer, self.targ_lang_tokenizer

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 3, 2022
@tilakrayal tilakrayal assigned chunduriv and unassigned tilakrayal Jun 6, 2022
@chunduriv
Copy link
Contributor

@VallabhMahajan1,

Here the tutorial describes Neural Machine Translation(NMT) using word level sequence-to-sequence model using TF Addons. Please refer to this tutorial for character level sequence-to-sequence using Tensorflow.

@chunduriv chunduriv added the stat:awaiting response Status - Awaiting response from author label Jun 10, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 17, 2022
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:docs-feature Doc issues for new feature, or clarifications about functionality type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

3 participants