You are viewing the RapidMiner Radoop documentation for version 9.9 - Check here for latest version
Connecting to a 3.0.1+ Hortonworks Sandbox
As of this writing the latest available version of Hortonworks Data Platform (HDP) on Hortonworks Sandbox VM is 3.0.1. This guide was created for that.
Start and configure the Sandbox VM
Download the Hortonworks Sandbox VM for VirtualBox from the Download website.
Import the OVA packaged VM to your virtualization environment (Virtualbox is covered in this guide).
Start the VM. After powering it on, you have to select the first option from the boot menu, then wait for the boot to complete.
Log in to the VM. You can do this by switching to the login console (Alt+F5), or even better via SSH on localhost port
2122
. It is important to note that there are 2 exposed SSH ports on the VM, one belongs to the VM itself (2122
), while the other (2222
) belongs to a Docker container running inside the VM. The username isroot
, the password ishadoop
for both.Edit the
/sandbox/proxy/generate-proxy-deploy-script.sh
by include the following ports in thetcpPortsHDP
array 8025, 8030, 8050, 10020, 50010.vi /sandbox/proxy/generate-proxy-deploy-script.sh
Find
tcpPortsHDP
variable, leaving the other values in place, add to the hashtable assignment:[8025]=8025 [8030]=8030 [8050]=8050 [10020]=10020 [50010]=50010
Run the edited generate-proxy-deploy-script.sh via
/sandbox/proxy/generate-proxy-deploy-script.sh
- This will re-create the /sandbox/proxy/proxy-deploy.sh script along with config files in /sandbox/proxy/conf.d and /sandbox/proxy/conf.stream.d, thus exposing the additional ports added to the
tcpPortsHDP
hashtable in previous step.
- This will re-create the /sandbox/proxy/proxy-deploy.sh script along with config files in /sandbox/proxy/conf.d and /sandbox/proxy/conf.stream.d, thus exposing the additional ports added to the
Run the /sandbox/proxy/proxy-deploy.sh script via
/sandbox/proxy/proxy-deploy.sh
- Running the
docker ps
command, will show an instance named sandbox-proxy and the ports it has exposed. The inserted values to thetcpPortsHDP
hashtable should be shown in the output, looking like 0.0.0.0:10020->10020/tcp.
- Running the
These changes only made sure that the referenced ports of the Docker container are accessible on the respective ports of the VM. Since the network adapter of the VM is attached to NAT, these ports are not accessible from your local machine. To make them available you have to add the port forwarding rules listed below to the VM. In VirtualBox you can find these settings under Machine / Settings / Network / Adapter 1 / Advanced / Port Forwarding.
Name Protocol Host IP Host Port Guest IP Guest Port resourcetracker TCP 127.0.0.1 8025 8025 resourcescheduler TCP 127.0.0.1 8030 8030 resoucemanager TCP 127.0.0.1 8050 8050 jobhistory TCP 127.0.0.1 10020 10020 datanode TCP 127.0.0.1 50010 50010 Edit your local
hosts
file (on your host operating system, not inside the VM), addsandbox.hortonworks.com
andsandbox-hdp.hortonworks.com
to your localhost entry. At the end it should look something like this:127.0.0.1 localhost sandbox.hortonworks.com sandbox-hdp.hortonworks.com
Reset Ambari access. Use an SSH client to login to localhost as root, this time using port
2222
! (For example, on OS X or Linux, use the commandssh root@localhost -p 2222
, password:hadoop
)- (At first login you have to set a new root password, do it and remember it.)
- Run
ambari-admin-password-reset
as root user. - Provide a new admin password for Ambari.
- Run
ambari-agent restart
.
Open the Ambari website:
http://sandbox.hortonworks.com:8080
- Login with
admin
and the password you chose in the previous step. - Navigate to the YARN / Configs / Memory configuration page.
- Edit the Memory Node Setting to at least 7 GB and click Override.
- User will be prompted to create a new "YARN Configuration Group", enter a new name.
- On the "Save Configuration Group" dialog, click the Manage Hosts button.
- On the "Manage YARN Configuration Groups page" take the node in the "Default" group and add the node into the group created in the "YARN Configuration Group" name step.
- "Warning" Dialog will open requesting adding notes click the Save button.
- "Dependent Configurations" dialog will open with Ambari providing recommendations to modify some related properties automatically. If so, untick
tez.runtime.io.sort.mb
to keep its original value. Click the Ok button.- Ambari may open a "Configurations" page suggesting stuff. Review accordingly, but this is out of the scope of this document, so just click Proceed Anyway.
- Navigate to the Hive / Configs / Advanced configuration page.
In the Custom hiveserver2-site section. The
hive.security.authorization.sqlstd.confwhitelist.append
needs to be added via the Add Property... and be set to the following (it must not contain whitespaces):radoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name|hive\.remove\.orderby\.in\.subquery
Save the configuration and restart all affected services. More details on
hive.security.authorization.sqlstd.confwhitelist.append
can be found in Hadoop Security/Configuring Apache Hive SQL Standard-based authorization section.
- Login with
Setup the connection in RapidMiner Studio
Click on New Connection button and choose Import from Cluster Manager option to create the connection directly from the configuration retrieved from Ambari.
On the Import Connection from Cluster Manager dialog enter
- Cluster Manager URL:
http://sandbox-hdp.hortonworks.com:8080
- Username:
admin
- Password: password used in Reset Amabari step.
- Cluster Manager URL:
Click Import Configuration
Hadoop Configuration Import dialog will open up
- If successful click Next button and Connection Settings dialog will open.
- If failed click Back button and review above steps and logs to solve issue(s).
On the Connection Settings Dialog, which opens when Next button is clicked from step above.
Connection Name can stay defaulted or be changed by user.
Global tab
- Hadoop Version should be
Hortonworks HDP 3.x
- Set Hadoop username to
hadoop
.
- Hadoop Version should be
Hadoop tab
- NameNode Address should be
sandbox-hdp.hortonworks.com
- NameNode Port should be
8020
- Resource Manager Address should be
sandbox-hdp.hortonworks.com
- Resource Manager Port should be
8050
- JobHistory Server Address should be
sandbox-hdp.hortonworks.com
- JobHistory Server Port should be
10020
Advanced Hadoop Parameters add the following parameters:
Key Value dfs.client.use.datanode.hostname
true
(This parameter is not required when using the Import Hadoop Configuration Files option):
Key Value mapreduce.map.java.opts
-Xmx256m
- NameNode Address should be
Spark tab
- Spark Version select
Spark 2.3 (HDP)
- Check Use default Spark path
- Spark Version select
Hive tab
- Hive Version should be
HiveServer3 (Hive 3 or newer)
- Hive High Availability should be checked
- ZooKeeper Quorum should be
sandbox-hdp.hortonworks.com:2181
- ZooKeeper Namespace should be
hiverserver2
- Database Name should be
default
- JDBC URL Postfix should be empty
- Username should be
hive
- Password should be empty
- UDFs are installed manually and Use custom database for UDFs are both unchecked
- Hive on Spark/Tez container reuse should be checked
- Hive Version should be
Click OK button, the Connection Settings dialog will close
User can test the connection created above onn Manage Radoop Connections page select the connection created and clicking the Quick Test and Full Test... buttons.
If errors occur durning testing confirm that necessary Components are started correctly at http://localhost:8080/#/main/hosts/sandbox-hdp.hortonworks.com/summary
.