Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

1 votes
0 answers
29 views
Hadoop + warnings as slow block-receive from data-node machines
We have Hadoop cluster with `487` data-nodes machines ( each data-node machine include also the Service node-manager ) , all machines are physical machines ( DELL ) , and OS is RHEL 7.9 version. Each data-node machine have 12 disks, each disk is with size of 12T Hadoop cluster type installed from HD...
We have Hadoop cluster with 487 data-nodes machines ( each data-node machine include also the Service node-manager ) , all machines are physical machines ( DELL ) , and OS is RHEL 7.9 version. Each data-node machine have 12 disks, each disk is with size of 12T Hadoop cluster type installed from HDP packages ( previously was under Horton-works and now under Cloudera ) Users are complain about slowness from spark applications that run on data-nodes machines And after investigation we seen the following warning from data-node logs 2024-03-18 17:41:30,230 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 401ms (threshold=300ms), downstream DNs=[172.87.171.24:50010, 172.87.171.23:50010] 2024-03-18 17:41:49,795 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 410ms (threshold=300ms), downstream DNs=[172.87.171.26:50010, 172.87.171.31:50010] 2024-03-18 18:06:29,585 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 303ms (threshold=300ms), downstream DNs=[172.87.171.34:50010, 172.87.171.22:50010] 2024-03-18 18:18:55,931 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 729ms (threshold=300ms), downstream DNs=[172.87.11.27:50010] from above log we can see the warning Slow BlockReceiver write packet to mirror took xxms and also the data-nodes machines as 172.87.171.23,172.87.171.24 etc. from my understanding the exceptions as Slow BlockReceiver write packet to mirror indicate maybe on delay in writing the block to OS cache or disk So I am trying to collect the reasons for this warning / exceptions , and here there are 1. delay in writing the block to OS cache or disk 2. cluster is as or near its resources limit ( memory , CPU or disk ) 3. network issues between machines From my verification I not see **disk** or **CPU** or **memory** problem , we checked all machines From network side I not see special issues that relevant to machines itself And we also used the iperf3 ro check the Bandwidth between one machine to other. here is example between data-node01 to data-node03 ( from my understanding and please Correct me if I am wrong looks like Bandwidth is ok ) From data-node01 iperf3 -i 10 -s [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 7.90 GBytes 6.78 Gbits/sec [ 5] 10.00-20.00 sec 8.21 GBytes 7.05 Gbits/sec [ 5] 20.00-30.00 sec 7.25 GBytes 6.23 Gbits/sec [ 5] 30.00-40.00 sec 7.16 GBytes 6.15 Gbits/sec [ 5] 40.00-50.00 sec 7.08 GBytes 6.08 Gbits/sec [ 5] 50.00-60.00 sec 6.27 GBytes 5.39 Gbits/sec [ 5] 60.00-60.04 sec 35.4 MBytes 7.51 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-60.04 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-60.04 sec 43.9 GBytes 6.28 Gbits/sec receiver From data-node03 iperf3 -i 1 -t 60 -c 172.87.171.84 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 792 MBytes 6.64 Gbits/sec 0 3.02 MBytes [ 4] 1.00-2.00 sec 834 MBytes 6.99 Gbits/sec 54 2.26 MBytes [ 4] 2.00-3.00 sec 960 MBytes 8.05 Gbits/sec 0 2.49 MBytes [ 4] 3.00-4.00 sec 896 MBytes 7.52 Gbits/sec 0 2.62 MBytes [ 4] 4.00-5.00 sec 790 MBytes 6.63 Gbits/sec 0 2.70 MBytes [ 4] 5.00-6.00 sec 838 MBytes 7.03 Gbits/sec 4 1.97 MBytes [ 4] 6.00-7.00 sec 816 MBytes 6.85 Gbits/sec 0 2.17 MBytes [ 4] 7.00-8.00 sec 728 MBytes 6.10 Gbits/sec 0 2.37 MBytes [ 4] 8.00-9.00 sec 692 MBytes 5.81 Gbits/sec 47 1.74 MBytes [ 4] 9.00-10.00 sec 778 MBytes 6.52 Gbits/sec 0 1.91 MBytes [ 4] 10.00-11.00 sec 785 MBytes 6.58 Gbits/sec 48 1.57 MBytes [ 4] 11.00-12.00 sec 861 MBytes 7.23 Gbits/sec 0 1.84 MBytes [ 4] 12.00-13.00 sec 844 MBytes 7.08 Gbits/sec 0 1.96 MBytes Note - Nic card/s are with 10G speed ( we checked this by ethtool ) We also checked the firmware-version of the NIC card ethtool -i p1p1 driver: i40e version: 2.8.20-k firmware-version: 8.40 0x8000af82 20.5.13 expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes We also checked from kernel messages ( dmesg ) but no seen something special. from dmesg about CPU dmesg | grep CPU [ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs [ 0.000000] smpboot: Ignoring 160 unusable CPUs in ACPI table [ 0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:2 [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=2 [ 0.000000] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=32. [ 0.184771] CPU0: Thermal monitoring enabled (TM1) [ 0.184943] TAA: Vulnerable: Clear CPU buffers attempted, no microcode [ 0.184944] MDS: Vulnerable: Clear CPU buffers attempted, no microcode [ 0.324340] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (fam: 06, model: 4f, stepping: 01) [ 0.327772] smpboot: CPU 1 Converting physical 0 to logical die 1 [ 0.408126] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 0.436824] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 0.436828] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details. [ 0.464933] Brought up 32 CPUs [ 3.223989] acpi LNXCPU:7e: hash matches [ 49.145592] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
yael (13936 rep)
Mar 19, 2024, 02:03 PM • Last activity: Mar 19, 2024, 04:40 PM
1 votes
0 answers
333 views
Clear RAM Memory Cache and buffer on production Hadoop cluster with HDFS filesystem
we have Hadoop cluster with 265 Linux RHEL machines. from total 265 machines, we have 230 data nodes machines with HDFS filesystem. total memory on each data-node is 128G and we run many spark applications on these machines. last month we added another spark applications, so process takes more memor...
we have Hadoop cluster with 265 Linux RHEL machines. from total 265 machines, we have 230 data nodes machines with HDFS filesystem. total memory on each data-node is 128G and we run many spark applications on these machines. last month we added another spark applications, so process takes more memory from data-nodes machines. we noticed that cache. memory is very important part, and when more process are running on machines, then the right conclusion is to add more RAM memory. since we can't do memory upgrade to 256G on next 5-6 month, then we are thinking about how to improve the performance of the RHEL machine and memory cash as possible. from our experience, memory Casch is very important for applications stability. one option is to clear the RAM memory cache and buffer as the following. 1. Clear PageCache only. # sync; echo 1 > /proc/sys/vm/drop_caches 2. Clear dentries and inodes. # sync; echo 2 > /proc/sys/vm/drop_caches 3. Clear PageCache, dentries and inodes. # sync; echo 3 > /proc/sys/vm/drop_caches and run them from the cron as following. ( from https://www.wissenschaft.com.ng/blog/how-to-clear-ram-memory-cache-buffer-and-swap-space-on-linux/ ) #!/bin/bash # Note, we are using "echo 3", but it is not recommended in production instead use "echo 1" echo "echo 3 > /proc/sys/vm/drop_caches" Set execute permission on the clearcache.sh file. # chmod 755 clearcache.sh Now you may call the script whenever you required to clear ram cache. Now set a cron to clear RAM cache everyday at 2am. Open crontab for editing. # crontab -e Append the below line, save and exit to run it at 2am daily. 0 2 * * * /path/to/clearcache.sh but since we are talking on production data-nodes machines, then I am not so sure that above settings are safety, and they give (?) some solution until we can increase the memory from 128G to 256G can we get yours ideas about what I wrote? and if the "Clear RAM Memory Cache" is the right temporary solution until memory upgrade
yael (13936 rep)
Mar 9, 2023, 07:34 PM
0 votes
1 answers
1341 views
CPU LOAD AVRG + how to deal process with D state
we can see from our `RHEL 7.6` server ( kernel version - `3.10.0-957.el7.x86_64` ) that following process are with `D` state ( they runs from `HDFS` user ) Note - *D state code means that process is in uninterruptible sleep* ps -eo s,user,cmd | grep ^[RD] D hdfs du -sk /grid/sdj/hadoop/hdfs/data/cur...
we can see from our RHEL 7.6 server ( kernel version - 3.10.0-957.el7.x86_64 ) that following process are with D state ( they runs from HDFS user ) Note - *D state code means that process is in uninterruptible sleep* ps -eo s,user,cmd | grep ^[RD] D hdfs du -sk /grid/sdj/hadoop/hdfs/data/current/BP-1018134753-10.3.6.170-1530088122990 D hdfs du -sk /grid/sdm/hadoop/hdfs/data/current/BP-1018134753-10.3.6.170-1530088122990 R root ps -eo s,user,cmd note's - the disks sdj and sdm are 3T byte size , also "du -sk" happens on other disks as sdd , sdf etc and the disks are with ext4 file-system we are suspect that the fact that we have high CPU load avrg is because the "du -sk" that actually run on the disks so I was thinking what we can do regarding to below behavior one option is maybe to disable the "du -sk" verification from HDFS , but no clue how to do that second option is to think what actually cause the D state ? I don't sure ... but maybe upgrade the kernel version will help to avoid D state? or else? ( like disable the CPU Thread(s) ) , etc ? more details lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 and CPU LOAD AVRG is around ~ 42-45 ( for 15min avrg ) Reference : https://community.cloudera.com/t5/Support-Questions/Does-hadoop-run-dfs-du-automatically-when-a-new-job-starts/td-p/231297 https://community.cloudera.com/t5/Support-Questions/Can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my-cluster/m-p/182402 https://community.pivotal.io/s/article/Dealing-with-Processes-in-State-D---Uninterruptible-Sleep-Usually-IO?language=en_US https://www.golinuxhub.com/2018/05/how-to-disable-or-enable-hyper/
yael (13936 rep)
Nov 28, 2021, 02:16 PM • Last activity: Nov 28, 2021, 02:43 PM
1 votes
2 answers
609 views
Convert list of hdf5 files to netcdf files with same name using shell scripting
I have a list of datasets containing satellite data arranged in monthly folders as follows: 01 02 03 04 05 06 07 08 09 10 11 12 These folder are further divided into daily data folder for example for first month `01`, daily files are arranged in folder as: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 1...
I have a list of datasets containing satellite data arranged in monthly folders as follows: 01 02 03 04 05 06 07 08 09 10 11 12 These folder are further divided into daily data folder for example for first month 01, daily files are arranged in folder as: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Which eventually has data sets: OMI-Aura_L2-PROFOZ_2016m0101t0059-o60970_v003-2016m0107t153711.he5.met OMI-Aura_L2-PROFOZ_2016m0101t1410-o60978_v003-2016m0107t153715.he5 OMI-Aura_L2-PROFOZ_2016m0101t0237-o60971_v003-2016m0107t153714.he5 OMI-Aura_L2-PROFOZ_2016m0101t1410-o60978_v003-2016m0107t153715.he5.met OMI-Aura_L2-PROFOZ_2016m0101t0237-o60971_v003-2016m0107t153714.he5.met OMI-Aura_L2-PROFOZ_2016m0101t1549-o60979_v003-2016m0107t153713.he5 OMI-Aura_L2-PROFOZ_2016m0101t0416-o60972_v003-2016m0107t153715.he5 OMI-Aura_L2-PROFOZ_2016m0101t1549-o60979_v003-2016m0107t153713.he5.met OMI-Aura_L2-PROFOZ_2016m0101t0416-o60972_v003-2016m0107t153715.he5.met OMI-Aura_L2-PROFOZ_2016m0101t1727-o60980_v003-2016m0107t153718.he5 OMI-Aura_L2-PROFOZ_2016m0101t0555-o60973_v003-2016m0107t153709.he5 OMI-Aura_L2-PROFOZ_2016m0101t1727-o60980_v003-2016m0107t153718.he5.met OMI-Aura_L2-PROFOZ_2016m0101t0555-o60973_v003-2016m0107t153709.he5.met OMI-Aura_L2-PROFOZ_2016m0101t1906-o60981_v003-2016m0107t153716.he5 OMI-Aura_L2-PROFOZ_2016m0101t0734-o60974_v003-2016m0107t153717.he5 OMI-Aura_L2-PROFOZ_2016m0101t1906-o60981_v003-2016m0107t153716.he5.met OMI-Aura_L2-PROFOZ_2016m0101t0734-o60974_v003-2016m0107t153717.he5.met OMI-Aura_L2-PROFOZ_2016m0101t2045-o60982_v003-2016m0107t153719.he5 OMI-Aura_L2-PROFOZ_2016m0101t0913-o60975_v003-2016m0107t153711.he5 OMI-Aura_L2-PROFOZ_2016m0101t2045-o60982_v003-2016m0107t153719.he5.met OMI-Aura_L2-PROFOZ_2016m0101t0913-o60975_v003-2016m0107t153711.he5.met OMI-Aura_L2-PROFOZ_2016m0101t2224-o60983_v003-2016m0107t153717.he5 I want to extract only files with .he5 extension and convert it to netcdf file using following code where file name is preserved. ncks inputfile.he5 inputfile.nc I am trying to process every file so I wrote a shell script as follow shopt -s globstar for f in ./**; do echo "$f" |grep -v .met | grep ".he5" echo "$f" |grep -v .met |ncks $(grep".he5") $(echo $(grep -o 'OMI-Aura_L2-PROFOZ_[0-9]\{4\}m[0-9]\{4\}t[0-9]\{4\}-o[0-9]\{5\}_v003-[0-9]\{4\}m[0-9]\{4\}t[0-9]\{6\}').nc) done It is able to extract files names but, I am not getting the ouput. How can I convert all files eventually?
Mala Pokhrel (13 rep)
Oct 1, 2021, 08:31 AM • Last activity: Oct 1, 2021, 08:56 AM
0 votes
0 answers
132 views
How to move the last n files in hdfs
I have a folder in HDFS contains 830000 files, and I want to move the last "8797" files enter code here to another folder in HDFS? I tried using xargs but didn't work fine. Any other ideas? Here is the exact split point between all files. I want to move files after "2021-03-09 15:15" ` -rw-rw-r--+ 3...
I have a folder in HDFS contains 830000 files, and I want to move the last "8797" files enter code here to another folder in HDFS? I tried using xargs but didn't work fine. Any other ideas? Here is the exact split point between all files. I want to move files after "2021-03-09 15:15" ` -rw-rw-r--+ 3 talend_user talend_group 102013 2021-03-09 15:14 /user/file_1 ` ` -rw-rw-r--+ 3 talend_user talend_group 9360 2021-03-09 15:15 /user/file_2 `
Omar AlSaghier (101 rep)
Jun 20, 2021, 11:56 AM • Last activity: Jun 20, 2021, 12:52 PM
0 votes
0 answers
137 views
how to run script from user hdfs without password
we create the following script on rhel 7.6 /home/run_tasks and in visudo we configured %sudo ALL=(ALL:ALL) ALL root ALL=(ALL) ALL hdfs ALL = (ALL) ALL hdfs ALL= (root) NOPASSWD: /home/run_tasks and ls -ltr /home/run_tasks -rwxrwxrwx 1 hdfs hdfs 6377 Sep 11 2019 /home/run_tasks so when we run the scr...
we create the following script on rhel 7.6 /home/run_tasks and in visudo we configured %sudo ALL=(ALL:ALL) ALL root ALL=(ALL) ALL hdfs ALL = (ALL) ALL hdfs ALL= (root) NOPASSWD: /home/run_tasks and ls -ltr /home/run_tasks -rwxrwxrwx 1 hdfs hdfs 6377 Sep 11 2019 /home/run_tasks so when we run the script as su hdfs -c "sudo /home/run_tasks" we get sudo: sorry, you must have a tty to run sudo and after we marked the following lines ( from visudo ) #Defaults requiretty #Defaults !visiblepw we get su hdfs -c "sudo /home/run_tasks" ls: Permission denied: user=root, access=EXECUTE, inode="/../../..":hdfs:hdfs:drwxr-x--- ls: Permission denied: user=root, access=EXECUTE, inode="/../../..":hdfs:hdfs:drwxr-x---
yael (13936 rep)
Sep 13, 2020, 02:09 PM • Last activity: Sep 13, 2020, 03:39 PM
2 votes
0 answers
1438 views
ssh: connect to host localhost port 22: Connection refused
I have installed `hadoop` and `ssh`. `hadoop` was working fine, then today I am getting the error below when I run the command `sbin/start-dfs.sh`: Starting namenodes on [localhost] localhost: ssh: connect to host localhost port 22: Connection refused Starting datanodes localhost: ssh: connect to ho...
I have installed hadoop and ssh. hadoop was working fine, then today I am getting the error below when I run the command sbin/start-dfs.sh: Starting namenodes on [localhost] localhost: ssh: connect to host localhost port 22: Connection refused Starting datanodes localhost: ssh: connect to host localhost port 22: Connection refused Starting secondary namenodes [chbpc-VirtualBox] chbpc-VirtualBox: ssh: connect to host chbpc-virtualbox port 22: Connection refused I've checked the ssh status, and I have the following error: ssh.service - OpenBSD Secure Shell server Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled) Active: failed (Result: start-limit-hit) since Tue 2020-02-04 15:34:10 +04; 3h 35min ago Process: 946 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255) Feb 04 15:34:09 chbpc-VirtualBox systemd: ssh.service: Control process exited, code=exited status=255 Feb 04 15:34:09 chbpc-VirtualBox systemd: Failed to start OpenBSD Secure Shell server. Feb 04 15:34:09 chbpc-VirtualBox systemd: ssh.service: Unit entered failed state. Feb 04 15:34:09 chbpc-VirtualBox systemd: ssh.service: Failed with result 'exit-code'. Feb 04 15:34:10 chbpc-VirtualBox systemd: ssh.service: Service hold-off time over, scheduling restart. Feb 04 15:34:10 chbpc-VirtualBox systemd: Stopped OpenBSD Secure Shell server. Feb 04 15:34:10 chbpc-VirtualBox systemd: ssh.service: Start request repeated too quickly. Feb 04 15:34:10 chbpc-VirtualBox systemd: Failed to start OpenBSD Secure Shell server. Feb 04 15:34:10 chbpc-VirtualBox systemd: ssh.service: Unit entered failed state. Feb 04 15:34:10 chbpc-VirtualBox systemd: ssh.service: Failed with result 'start-limit-hit'. How can I fix this?
Sanaya (31 rep)
Feb 4, 2020, 03:10 PM • Last activity: Feb 4, 2020, 09:44 PM
1 votes
0 answers
322 views
master: ssh: connect to host master port 22: Connection refused
i am trying to start my hadoop cluster using the command "start-dfs.sh" but am getting errors as shown below Starting namenodes on [master] master: ssh: connect to host master port 22: Connection refused Starting datanodes master: ssh: connect to host master port 22: Connection refused ive checked t...
i am trying to start my hadoop cluster using the command "start-dfs.sh" but am getting errors as shown below Starting namenodes on [master] master: ssh: connect to host master port 22: Connection refused Starting datanodes master: ssh: connect to host master port 22: Connection refused ive checked the ssh status, it then returns me: ssh.service - OpenBSD Secure Shell server Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enab Active: failed (Result: start-limit-hit) since Tue 2020-02-04 14:15:01 +04; 2 Process: 5017 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255) Feb 04 14:15:00 hadoop-HP-Pro3500-Series systemd: ssh.service: Unit entered f Feb 04 14:15:00 hadoop-HP-Pro3500-Series systemd: ssh.service: Failed with re Feb 04 14:15:01 hadoop-HP-Pro3500-Series systemd: ssh.service: Service hold-o Feb 04 14:15:01 hadoop-HP-Pro3500-Series systemd: Stopped OpenBSD Secure Shel Feb 04 14:15:01 hadoop-HP-Pro3500-Series systemd: ssh.service: Start request Feb 04 14:15:01 hadoop-HP-Pro3500-Series systemd: Failed to start OpenBSD Sec Feb 04 14:15:01 hadoop-HP-Pro3500-Series systemd: ssh.service: Unit entered f how to fix this?
Sanaya (31 rep)
Feb 4, 2020, 10:33 AM • Last activity: Feb 4, 2020, 10:52 AM
-2 votes
1 answers
95 views
Hadoop cluster + designing number of disks on data node machine and min requirements
we are using HDP version - 2.6.5 , and HDFS Block replication is 3 we are try to understand data nodes disks min requirements for production mode and according to the fact that Block replication=3 since we are talking about production cluster and regrading to HDFS replica = 3 what should be the min...
we are using HDP version - 2.6.5 , and HDFS Block replication is 3 we are try to understand data nodes disks min requirements for production mode and according to the fact that Block replication=3 since we are talking about production cluster and regrading to HDFS replica = 3 what should be the min disks number per data-node machine?
yael (13936 rep)
Jan 19, 2020, 08:01 PM • Last activity: Jan 19, 2020, 09:06 PM
1 votes
1 answers
3407 views
how to find the owner of user and group from user HDFS
we can grant the permissions as hdfs user for hive as the following su hdfs $ hdfs dfs -chown hive:2098 but how to do the opposite way? in order to verify the owner of hive and hive group?
we can grant the permissions as hdfs user for hive as the following su hdfs $ hdfs dfs -chown hive:2098 but how to do the opposite way? in order to verify the owner of hive and hive group?
yael (13936 rep)
Dec 2, 2019, 08:28 AM • Last activity: Dec 2, 2019, 08:55 AM
Showing page 1 of 10 total questions