完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
我正在使用IntelAI节点来训练pytorch中的深层网络。
但是,当我运行程序时,出现内存不足错误。 我的训练数据大小为1GB,并按批量3进行批量加载。我无法进一步降低内存需求。 请帮忙 以上来自于谷歌翻译 以下为原文 I am using IntelAI node for training a deep network in pytorch. However I get an out of memory error when I run the program. My training data size is 1GB and loaded in batches of size 3. I have no options to further reduce memory requirements. Please help |
|
相关推荐
10个回答
|
|
嗨Jhilik,
您能否附上错误的屏幕截图。另外,请确认您使用的是Jupyter Hub还是Putty / SSH终端。 问候,安居房 以上来自于谷歌翻译 以下为原文 Hi Jhilik, Could you please attach the screenshot of the error. Also, please confirm if you are using Jupyter Hub or Putty/SSH terminal. Regards, Anju |
|
|
|
这是我得到的错误....使用ssh连接 上次登录时间:Tue Sep 4 03:54:13 2018从10.5.0.7 [u19304 @ c009~] $ source activate en (en)[u19304 @ c009~] $ cd bum (en)[u19304 @ c009 bum] $ python3 bumcpu.py 回溯(最近的呼叫最后): 文件“bumcpu.py”,第211行,in training_set = DLibdata(train = True) 在__init__中输入“/home/u19304/bum/loaddata.py”,第46行 self.train_data = torch.load('trn.pt') 文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第358行,载入中 return _load(f,map_location,pickle_module) 在_load中输入文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第542行 result = unpickler.load() 文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第508行,在persistent_load中 data_type(大小),位置) RuntimeError:$ Torch:没有足够的内存:你试图分配2GB。 买新的RAM! at /opt/conda/conda-bld/pytorch-cpu_1532576596369/work/aten/src/TH/THGeneral.cpp:204 以上来自于谷歌翻译 以下为原文 This is the error I get....Am using ssh to connect Last login: Tue Sep 4 03:54:13 2018 from 10.5.0.7 [u19304@c009 ~]$ source activate en (en) [u19304@c009 ~]$ cd bum (en) [u19304@c009 bum]$ python3 bumcpu.py Traceback (most recent call last): File "bumcpu.py", line 211, in training_set = DLibdata(train=True) File "/home/u19304/bum/loaddata.py", line 46, in __init__ self.train_data = torch.load('trn.pt') File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load result = unpickler.load() File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load data_type(size), location) RuntimeError: $ Torch: not enough memory: you tried to allocate 2GB. Buy new RAM! at /opt/conda/conda-bld/pytorch-cpu_1532576596369/work/aten/src/TH/THGeneral.cpp:204 |
|
|
|
cd340823 发表于 2018-11-14 10:18 嗨Jhilik,您正在尝试在登录节点上运行该程序。 登录节点不是为了承担繁重的工作负载而设计的。 所有计算密集型作业都必须在计算节点上运行。要执行此操作,您可以使用以下任一选项:1。 输入qsub -I。 这将为您提供其中一个计算节点上的交互式终端。 你可以在那里执行你的程序。 将所有bash命令包装在脚本文件中(例如“job.sh”)并提供“qsub job.sh”。 这将把你的工作提交给调度程序,调度程序将获取脚本并在计算节点中执行它。有关详细信息,请参阅以下文档:https://communities.intel.com/docs/DOC-112425https: //communities.intel.com/docs/DOC-112294https://communities.intel.com/docs/DOC-112293https://communities.intel.com/docs/DOC-112422https://communities.intel.com/ 线程/ 127653Regards,安居房 以上来自于谷歌翻译 以下为原文 Hi Jhilik, You are trying to run the program on login node. Login nodes are not designed to take heavy workloads. All compute intensive jobs have to be run on compute nodes. To do this, You can use either of the following options: 1. Type qsub -I. This will give you an interactive terminal on one of the compute nodes. You can execute your program there. 2. Wrap all your bash commands in a script file (say "job.sh") and give "qsub job.sh". This will submit your job to the scheduler, which will take the script and execute it in the compute node. For more details on this, please refer the following documents: https://communities.intel.com/docs/DOC-112425 https://communities.intel.com/docs/DOC-112294 https://communities.intel.com/docs/DOC-112293 https://communities.intel.com/docs/DOC-112422 https://communities.intel.com/thread/127653 Regards, Anju |
|
|
|
jerry1978 发表于 2018-11-14 10:35 嗨Jhilik,你能否确认提供的解决方案是否有帮助.Regards,Anju 以上来自于谷歌翻译 以下为原文 Hi Jhilik, Could you please confirm if the solution provided helped. Regards, Anju |
|
|
|
感谢您的帮助。 当我使用qsub连接到节点时,我不再出现内存错误。 但是我仍然无法运行代码。 我正在使用cv2包,我在虚拟环境中安装它。 当我从普通终端使用它时它工作(快照附加) 上次登录:星期四9月6日23:50:21 2018从10.5.0.7 [u19304 @ c009~] $ source activate en (en)[u19304 @ c009~] $ python3 Python 3.6.3 |英特尔公司| (默认,2018年5月4日,04:22:28) Linux上的[GCC 4.8.2 20140120(Red Hat 4.8.2-15)] 输入“帮助”,“版权”,“信用”或“许可”以获取更多信息。 英特尔(R)Python分发由英特尔公司提供给您。 请查看:https://software.intel.com/en-us/python-distribution >>>导入cv2 >>> 但是,当我使用qsub进入节点并尝试使用它时,这就是我得到的 (en)[u19304 @ c009-n014 bum] $ python3 Python 3.6.3 |英特尔公司| (默认,2018年5月4日,04:22:28) Linux上的[GCC 4.8.2 20140120(Red Hat 4.8.2-15)] 输入“帮助”,“版权”,“信用”或“许可”以获取更多信息。 英特尔(R)Python分发由英特尔公司提供给您。 请查看:https://software.intel.com/en-us/python-distribution >>>导入cv2 回溯(最近的呼叫最后): 文件“”,第1行,in ImportError:/lib64/libpangoft2-1.0.so.0:未定义的符号:hb_buffer_set_cluster_level >>> 我尝试从节点重新安装,但它说已经安装了软件包。 此外,当我使用tmux并让程序运行并从服务器分离会话和注销时,我可以在几分钟后附加会话以查找运行的代码。 但是,如果我在一小时后回来,它会告诉我没有tmux会话。 我想让工作继续运行并注销。 以上来自于谷歌翻译 以下为原文 Thankyou for your help. When I connect to the node using qsub I no longer get memory error. However I still cannot run the code. I am using cv2 package, i installed it in a virtual environment en. When i use it from the normal terminal it works (snapshot attached) Last login: Thu Sep 6 23:50:21 2018 from 10.5.0.7 [u19304@c009 ~]$ source activate en (en) [u19304@c009 ~]$ python3 Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution >>> import cv2 >>> However when I enter node using qsub and try using it this is what I get (en) [u19304@c009-n014 bum]$ python3 Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution >>> import cv2 Traceback (most recent call last): File " ImportError: /lib64/libpangoft2-1.0.so.0: undefined symbol: hb_buffer_set_cluster_level >>> I tried reinstalling from node but it says packages already installed. Also, when I use tmux and leave the program running and detach the session and logout from the server, i can attach the session after a few minutes to find the code running. However if I come back after an hour it tells me no tmux session. I want to leave the job running and logout. |
|
|
|
嗨Jhilik,关于导入错误的问题,我们无法重新创建问题。 这里的进口工作正常。所以,请你分享一下重现问题的确切步骤。 另外,请在激活环境(en)后附上“conda list”的结果。关于查询的tmux部分,如果你想让工作继续运行,注销然后再回来查看进度,我们强烈建议你 使用qsub而不是qsub -I。为此,将所有命令包装在脚本文件中,并在登录节点上给出命令“qsub”。要检查作业状态,请使用“qstat”检查实时日志 这份工作,使用“qpeek -e”& “qpeek -o”。一旦完成工作,可以在文件中找到日志,.o& .eRegards,安居房 以上来自于谷歌翻译 以下为原文 Hi Jhilik, Regarding the issue on import error, we could not recreate the issue. The imports are working fine here. So, could you please share the exact steps to reproduce the problem. Also, kindly attach the result of "conda list" after activating environment(en). Regarding the tmux part of the query, if you would like leave the job running, logout and then come back later to check the progress, We would strongly recommend you to use qsub, instead of qsub -I. For this, wrap all your commands in a script file and give the command "qsub To check the status of the job use "qstat" To check the real time logs of the job, use "qpeek -e Once the job is completed, logs could be found in the files, Regards, Anju |
|
|
|
cd340823 发表于 2018-11-14 10:58 谢谢。 这是我使用qsub -I登录节点后的列表 [u19304 @ c009-n061~] $ source activate en (en)[u19304 @ c009-n061~] $ conda list #home/u19304/.conda/envs/en环境中的#包: # blas 1.0 mkl bzip2 1.0.6 intel_14 [intel] intel cairo 1.14.10 intel_0 [intel] intel cffi 1.11.5 py36_intel_1 [intel] intel cycler 0.10.0 py36_intel_5 [intel] intel dbus 1.13.0 h3a4f0e9_0 conda-forge expat 2.2.5 hfc679d8_2 conda-forge ffmpeg 3.2.4 hf82bc7d_4 conda-forge fontconfig 2.12.5 intel_0 [intel] intel freetype 2.8 intel_0 [intel] intel giflib 5.1.4 h470a237_1 conda-forge glib 2.53.6 h5d9569c_2 graphite2 1.3.12 hfc679d8_1 conda-forge gst-plugins-base 1.12.4 h33fb286_0 gstreamer 1.12.4 hb53b477_0 harfbuzz 0.9.39 1 hdf5 1.10.1 intel_0 [intel] intel icc_rt 2018.0.3 intel_0 [intel] intel icu 59.1 intel_0 [intel] intel intelpython 2018.0.3 0 intel ipp 2018.0.3 intel_0 intel jasper 1.900.1 hff1ad4c_5 conda-forge jpeg 9c intel_0 [intel] intel kiwisolver 1.0.1 py36_1 intel libffi 3.2.1 intel_8 [intel] intel libgcc-ng 8.2.0 hdf63c60_1 libgfortran 3.0.0 1 conda-forge libiconv 1.15 h470a237_3 conda-forge libpng 1.6.34 intel_1 [intel] intel libstdcxx-ng 8.2.0 hdf63c60_1 libtiff 4.0.9 intel_2 [intel] intel libwebp 0.5.2 7 conda-forge libxcb 1.13 h470a237_2 conda-forge libxml2 2.9.5 intel_0 [intel] intel matplotlib 2.1.1 np114py36_intel_2 [intel] intel mkl 2018.0.3 intel_1 intel mkl_fft 1.0.2 np114py36_intel_0 [intel] intel mkl_random 1.0.1 np114py36_intel_0 [intel] intel 忍者1.8.2 py36h6bb024c_1 numpy 1.14.3 py36_intel_0 [intel] intel olefile 0.44 py36_intel_0 [intel] intel openblas 0.2.20 8 conda-forge opencv 3.1.0 np114py36_intel_8 [intel] intel opencv3 3.1.0 py36_0 menpo openmp 2018.0.3 intel_0 intel openssl 1.0.2o intel_0 [intel] intel pcre 8.41 hfc679d8_3 conda-forge 枕头4.2.1 py36_intel_0 [intel] intel pip 9.0.1 py36_intel_0 [intel] intel pixman 0.34.0 intel_0 [intel] intel protobuf 3.5.2 py36_intel_0 [intel] intel pthread-stubs 0.4 h470a237_1 conda-forge pycparser 2.18 py36_intel_0 [intel] intel pyparsing 2.2.0 py36_intel_0 [intel] intel python 3.6.3 intel_12 [intel] intel python-dateutil 2.6.0 py36_intel_3 [intel] intel pytorch 0.4.1 py36_py35_py27__9.0.176_7.1.2_2 pytorch pytorch-cpu 0.4.1 py36_cpu_1 pytorch pytz 2018.4 py36_intel_0 [intel] intel qt 4.8.7 3 setuptools 27.2.0 py36_intel_0 [intel] intel 六,1.10.0 py36_intel_8 [intel] intel sqlite 3.23.1 intel_0 [intel] intel tbb 2018.0.1 py36_intel_4 [intel] intel tcl 8.6.4 intel_19 [intel] intel tk 8.6.4 intel_26 [intel] intel torchvision 0.2.1 py36_1 pytorch tqdm 4.25.0 py36h28b3542_0 wheel 0.31.0 py36_intel_0 [intel] intel x264 1!152.20180717 h470a237_0 conda-forge xorg-libxau 1.0.8 h470a237_6 conda-forge xorg-libxdmcp 1.1.2 h470a237_7 conda-forge xz 5.2.3 intel_0 [intel] intel zlib 1.2.11 intel_3 [intel] intel (en)[u19304 @ c009-n061~] $ python3 Python 3.6.3 |英特尔公司| (默认,2018年5月4日,04:22:28) Linux上的[GCC 4.8.2 20140120(Red Hat 4.8.2-15)] 输入“帮助”,“版权”,“信用”或“许可”以获取更多信息。 英特尔(R)Python发行版由英特尔公司提供给您。 请查看:https://software.intel.com/en-us/python-distribution >>>导入cv2 回溯(最近的呼叫最后): 文件“”,第1行,in ImportError:/lib64/libpangoft2-1.0.so.0:未定义的符号:hb_buffer_set_cluster_level 以上来自于谷歌翻译 以下为原文 Thank you. This is the list after i use qsub -I to log in the node [u19304@c009-n061 ~]$ source activate en (en) [u19304@c009-n061 ~]$ conda list # packages in environment at /home/u19304/.conda/envs/en: # blas 1.0 mkl bzip2 1.0.6 intel_14 [intel] intel cairo 1.14.10 intel_0 [intel] intel cffi 1.11.5 py36_intel_1 [intel] intel cycler 0.10.0 py36_intel_5 [intel] intel dbus 1.13.0 h3a4f0e9_0 conda-forge expat 2.2.5 hfc679d8_2 conda-forge ffmpeg 3.2.4 hf82bc7d_4 conda-forge fontconfig 2.12.5 intel_0 [intel] intel freetype 2.8 intel_0 [intel] intel giflib 5.1.4 h470a237_1 conda-forge glib 2.53.6 h5d9569c_2 graphite2 1.3.12 hfc679d8_1 conda-forge gst-plugins-base 1.12.4 h33fb286_0 gstreamer 1.12.4 hb53b477_0 harfbuzz 0.9.39 1 hdf5 1.10.1 intel_0 [intel] intel icc_rt 2018.0.3 intel_0 [intel] intel icu 59.1 intel_0 [intel] intel intelpython 2018.0.3 0 intel ipp 2018.0.3 intel_0 intel jasper 1.900.1 hff1ad4c_5 conda-forge jpeg 9c intel_0 [intel] intel kiwisolver 1.0.1 py36_1 intel libffi 3.2.1 intel_8 [intel] intel libgcc-ng 8.2.0 hdf63c60_1 libgfortran 3.0.0 1 conda-forge libiconv 1.15 h470a237_3 conda-forge libpng 1.6.34 intel_1 [intel] intel libstdcxx-ng 8.2.0 hdf63c60_1 libtiff 4.0.9 intel_2 [intel] intel libwebp 0.5.2 7 conda-forge libxcb 1.13 h470a237_2 conda-forge libxml2 2.9.5 intel_0 [intel] intel matplotlib 2.1.1 np114py36_intel_2 [intel] intel mkl 2018.0.3 intel_1 intel mkl_fft 1.0.2 np114py36_intel_0 [intel] intel mkl_random 1.0.1 np114py36_intel_0 [intel] intel ninja 1.8.2 py36h6bb024c_1 numpy 1.14.3 py36_intel_0 [intel] intel olefile 0.44 py36_intel_0 [intel] intel openblas 0.2.20 8 conda-forge opencv 3.1.0 np114py36_intel_8 [intel] intel opencv3 3.1.0 py36_0 menpo openmp 2018.0.3 intel_0 intel openssl 1.0.2o intel_0 [intel] intel pcre 8.41 hfc679d8_3 conda-forge pillow 4.2.1 py36_intel_0 [intel] intel pip 9.0.1 py36_intel_0 [intel] intel pixman 0.34.0 intel_0 [intel] intel protobuf 3.5.2 py36_intel_0 [intel] intel pthread-stubs 0.4 h470a237_1 conda-forge pycparser 2.18 py36_intel_0 [intel] intel pyparsing 2.2.0 py36_intel_0 [intel] intel python 3.6.3 intel_12 [intel] intel python-dateutil 2.6.0 py36_intel_3 [intel] intel pytorch 0.4.1 py36_py35_py27__9.0.176_7.1.2_2 pytorch pytorch-cpu 0.4.1 py36_cpu_1 pytorch pytz 2018.4 py36_intel_0 [intel] intel qt 4.8.7 3 setuptools 27.2.0 py36_intel_0 [intel] intel six 1.10.0 py36_intel_8 [intel] intel sqlite 3.23.1 intel_0 [intel] intel tbb 2018.0.1 py36_intel_4 [intel] intel tcl 8.6.4 intel_19 [intel] intel tk 8.6.4 intel_26 [intel] intel torchvision 0.2.1 py36_1 pytorch tqdm 4.25.0 py36h28b3542_0 wheel 0.31.0 py36_intel_0 [intel] intel x264 1!152.20180717 h470a237_0 conda-forge xorg-libxau 1.0.8 h470a237_6 conda-forge xorg-libxdmcp 1.1.2 h470a237_7 conda-forge xz 5.2.3 intel_0 [intel] intel zlib 1.2.11 intel_3 [intel] intel (en) [u19304@c009-n061 ~]$ python3 Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution >>> import cv2 Traceback (most recent call last): File " ImportError: /lib64/libpangoft2-1.0.so.0: undefined symbol: hb_buffer_set_cluster_level |
|
|
|
嗨Jhilik,我们能够在您的环境中安装一些软件包之后重现您的问题。 运行以下命令解决了我们的问题.conda clean --package请等待一段时间直到清理完成。如果它适用于你,请恢复。还不知道为什么你安装了pytorch的gpu版本。 建议只使用你需要的包装来保持conda环境的清洁。注意,Anju 以上来自于谷歌翻译 以下为原文 Hi Jhilik, We were able to reproduce your issue after installing some of the packages as in your environment. Running the following command solved the issue for us. conda clean --package Please wait for some time till the cleaning is completed. Please revert if it works for you. Also, not sure why you installed a gpu version of pytorch. It is advisable to keep the conda environment clean with only packages that you need. Regards, Anju |
|
|
|
jerry1978 发表于 2018-11-14 11:24 嗨Jhilik,请确认提供的解决方案是否适用于你.Regards,Dona 以上来自于谷歌翻译 以下为原文 Hi Jhilik, Please confirm if the solution provided worked for you. Regards, Dona |
|
|
|
嗨Jhilik,关闭此线程,因为没有回应。可以随意打开另一个线程进一步查询.Regards,Anju 以上来自于谷歌翻译 以下为原文 Hi Jhilik, Closing this thread since there is no response. Feel free to open another thread for further queries. Regards, Anju |
|
|
|
只有小组成员才能发言,加入小组>>
540浏览 0评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2025-1-15 21:19 , Processed in 0.665887 second(s), Total 63, Slave 57 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (威廉希尔官方网站 图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号