How to use Kaggle API with Azure Machine Learning Service
Kaggle is an Airbnb for Data Scientists — this is where they spend their nights and weekends. It is a crowd-sourced platform to attract, nurture, train, and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems (Usmani).
Azure Machine Learning is a cloud-based environment you can use to train, deploy, automate, manage, and track ML models (overview). Azure Machine Learning can be used for any kind of machine learning, from classical ml to deep learning, supervised, and unsupervised learning. Whether you prefer to write Python or R code with the SDK or work with no-code/low-code options in the studio, you can build, train, and track machine learning and deep-learning models in an Azure Machine Learning Workspace.
So, there is a huge need for an integration for someone working with a Kaggle Dataset on Azure Machine Learning Service. Saw it getting done by downloading the Kaggle data in local machine and manually uploading to Azure Machine Learning workspace. There is no reference on the internet describing an end way of integrating the two addressing all the potential pitfalls. Hence this post will talk about an easy way of integrating the two. You can take the codebase and make it better, but the raw version works.
Now fundamentally the basis of the integration is the Kaggle API accessible using a command line tool implemented in Python 3 (kaggle-api). Rest of the post will walk through step-by-step guide on implementing the integration using Python notebooks in Azure Machine learning Workspace.
Step1:
Install Kaggle.
Command: pip install kaggle
Step2:
Setup Directory Structure.
Command:
data_folder = os.path.join(os.getcwd(), ‘data’)
#Create the data directory
os.makedirs(data_folder, exist_ok=True)
kaggle_folder = os.path.join(os.getcwd(), ‘.kaggle’)
#Create the data directory
os.makedirs(kaggle_folder, exist_ok=True)
kaggle_key_folder = ‘/home/azureuser/.kaggle’
#Create the .kaggle key directory
os.makedirs(kaggle_key_folder, exist_ok=True)
Step3:
Generate new API token from Kaggle account page. Ref: https://github.com/Kaggle/kaggle-api#api-credentials
This will generate and download kaggle.json in your local and it contains a line indicating your Kaggle username and the token.
Step4:
Manually upload the kaggle.json generated from kaggle in .kaggle folder which contains the username and the key and then copy it to ‘/home/azureuser/.kaggle’. Once copied remove the file you uploaded.
Command:
import shutil
import os
kaggle_file = kaggle_folder + ‘/kaggle.json’
shutil.copy(kaggle_file, kaggle_key_folder)
os.remove(kaggle_file)
Step5:
chmod so that no other user can read the Kaggle.json and import.
Command:
!chmod 600 /home/azureuser/.kaggle/kaggle.json
import kaggle
!kaggle — version
Step6:
Test the API.
Command:
!kaggle — version
#List the competition files
!kaggle competitions files -c zillow-prize-1
Step7:
Download the competition zipped file at the current working directory
Command:
!kaggle competitions download -c zillow-prize-1
Wait what ?? You get a 403 !!!! You have to accept the terms and condition in the Kaggle Website to download the data and then the error should go away.
Step7a:
Accept the terms and conditions of the data usage at the Kaggle Website (if you wish to :)).
Step7b:
Download the competition zipped file at the current working directory
Command:
!kaggle competitions download -c zillow-prize-1
Step8:
Extract the zipped file and list.
Command:
#Extract files in the data folder
import zipfile
with zipfile.ZipFile(“zillow-prize-1.zip”,”r”) as zip_ref:
zip_ref.extractall(data_folder)
#List the folder structure and the files
for root, directories, files in os.walk(data_folder, topdown=True):
for name in files:
print(os.path.join(root, name))
Lastly you can read the file using Pandas or any other libraries in Python.
Command:
import pandas as pd
properties_file = ‘/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-cpu-ds12-v2/code/Users/rabiswas/HomeValuePrediction/data/properties_2017.csv’
df_properties = pd.read_csv(properties_file)
#It is easier to view the data if we transpose
df_properties.head(3).transpose()
You can import and just execute the notebook from my GitHub repo as well: AMLS_With_Kaggle/Get_Zillow_Properties_Data.ipynb at main · RajdeepBiswas/AMLS_With_Kaggle (github.com)
Hope this helps, please let me know if you have any questions.
References
kaggle-api. (n.d.). Retrieved from github.com: https://github.com/Kaggle/kaggle-api
overview. (n.d.). Retrieved from docs.microsoft.com: https://docs.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-ml
Usmani, Z.-u.-h. (n.d.). getting-started. Retrieved from www.kaggle.com: https://www.kaggle.com/getting-started/44916