[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

Open
sambar1729 opened this issue Jun 26, 2024 · 0 comments
Open

[QUESTION] Getting tools/preprocess_data.py to work is painful #892

sambar1729 opened this issue Jun 26, 2024 · 0 comments

Comments

@sambar1729
Copy link
sambar1729 commented Jun 26, 2024

Your question
Can tools/preprocess_data.py be simplified?

Using

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab-file gpt2-vocab.json \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Right now, it requires nltk, torch, transformer_engine, as well as apex.
Installing transformer_engine does not work out of the box -- had to install out of box (on a A100).
Installing apex has similar problems, when using https://github.com/NVIDIA/apex?tab=readme-ov-file#linux

Given that the repo does not have some sample idx, bin files, one would expect the preprocess_data process to be relatively simple. Could this process be simplified?

Installing apex

$ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

gives

....
....
  If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
    warnings.warn(
  Emitting ninja build file /home/megauser/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] c++ -MMD -MF /home/megauser/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o.d -pthread -B /home/megauser/.conda/envs/pre/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/megauser/.conda/envs/pre/include -fPIC -O2 -isystem /home/megauser/.conda/envs/pre/include -fPIC -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/TH -I/home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/megauser/.conda/envs/pre/include/python3.10 -c -c /home/megauser/apex/csrc/mlp.cpp -o /home/megauser/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=mlp_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/megauser/apex/csrc/mlp.cpp: In function ‘std::vector<at::Tensor> mlp_forward(int, int, std::vector<at::Tensor>)’:
  /home/megauser/apex/csrc/mlp.cpp:57:21: warning: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Wsign-compare]
     57 |   for (int i = 0; i < num_layers; i++) {
        |                   ~~^~~~~~~~~~~~
  /home/megauser/apex/csrc/mlp.cpp:64:77: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
     64 |   auto out = at::empty({batch_size, output_features.back()}, inputs[0].type());
        |                                                                             ^
  In file included from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/core/Tensor.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/Tensor.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                   from /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/torch/extension.h:5,
                   from /home/megauser/apex/csrc/mlp.cpp:1:
  /home/megauser/.conda/envs/pre/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:225:30: note: declared here
    225 |   DeprecatedTypeProperties & type() const {
        |                              ^~~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant