- Fork and clone repo
- Run
./setup.sh
- Run
./get_data.sh
If you encounter any problems during the setup or during the labs, check our troubleshooting guide.
- An IDE (IntelliJ is recommended)
- Run bash shell in container:
docker run -v $(pwd)/data:/usr/local/data/ -it sequenceiq/hadoop-docker:2.7.1 /etc/bootstrap.sh -bash
cd $HADOOP_PREFIX
- execute commands listed in lab!
- Define SPARK_HOME:
export SPARK_HOME=$(pwd)/spark-2.3.1-bin-hadoop2.7
- Activate virtual env:
source .venv_data_eng_bootcamp/bin/activate
- To start spark in spark shell, run:
$SPARK_HOME/bin/pyspark --master local
- To start spark in jupyter notebook, run:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook $SPARK_HOME/bin/pyspark --master local
- To deactivate the virtual environment, run:
deactivate