[QUESTION] Getting tools/preprocess_data.py to work is painful #892

sambar1729 · 2024-06-26T19:34:58Z

Your question
Can tools/preprocess_data.py be simplified?

Using

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab-file gpt2-vocab.json \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Right now, it requires nltk, torch, transformer_engine, as well as apex.
Installing transformer_engine does not work out of the box -- had to install out of box (on a A100).
Installing apex has similar problems, when using https://github.com/NVIDIA/apex?tab=readme-ov-file#linux

Given that the repo does not have some sample idx, bin files, one would expect the preprocess_data process to be relatively simple. Could this process be simplified?

Installing apex

$ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

gives

....
....
  If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
    warnings.warn(
  Emitting ninja build file /home/megauser/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] c++ -MMD -MF /home/megauser/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o.d -pthread -B /home/megauser/.conda/envs/pre/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/megauser/.conda/envs/pre/include -fPIC -O2 -isystem /home/megauser/.conda/envs/pre/include -fPIC -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/TH -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/megauser/.conda/envs/pre/include/python3.10 -c -c /home/megauser/apex/csrc/mlp.cpp -o /home/megauser/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=mlp_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/megauser/apex/csrc/mlp.cpp: In function ‘std::vector<at::Tensor> mlp_forward(int, int, std::vector<at::Tensor>)’:
  /home/megauser/apex/csrc/mlp.cpp:57:21: warning: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Wsign-compare]
     57 |   for (int i = 0; i < num_layers; i++) {
        |                   ~~^~~~~~~~~~~~
  /home/megauser/apex/csrc/mlp.cpp:64:77: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
     64 |   auto out = at::empty({batch_size, output_features.back()}, inputs[0].type());
        |                                                                             ^
  In file included from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/core/Tensor.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/Tensor.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/extension.h:5,
                   from /home/megauser/apex/csrc/mlp.cpp:1:
  /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:225:30: note: declared here
    225 |   DeprecatedTypeProperties & type() const {
        |                              ^~~~

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

Comments

Installing apex