How to use Kaggle API with Azure Machine Learning Service

Rajdeep Biswas
4 min readApr 28, 2021

--

Kaggle is an Airbnb for Data Scientists — this is where they spend their nights and weekends. It is a crowd-sourced platform to attract, nurture, train, and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems (Usmani).

Azure Machine Learning is a cloud-based environment you can use to train, deploy, automate, manage, and track ML models (overview). Azure Machine Learning can be used for any kind of machine learning, from classical ml to deep learning, supervised, and unsupervised learning. Whether you prefer to write Python or R code with the SDK or work with no-code/low-code options in the studio, you can build, train, and track machine learning and deep-learning models in an Azure Machine Learning Workspace.

So, there is a huge need for an integration for someone working with a Kaggle Dataset on Azure Machine Learning Service. Saw it getting done by downloading the Kaggle data in local machine and manually uploading to Azure Machine Learning workspace. There is no reference on the internet describing an end way of integrating the two addressing all the potential pitfalls. Hence this post will talk about an easy way of integrating the two. You can take the codebase and make it better, but the raw version works.

Now fundamentally the basis of the integration is the Kaggle API accessible using a command line tool implemented in Python 3 (kaggle-api). Rest of the post will walk through step-by-step guide on implementing the integration using Python notebooks in Azure Machine learning Workspace.

Step1:

Install Kaggle.

Command: pip install kaggle

pip install kaggle

Step2:

Setup Directory Structure.

Command:

data_folder = os.path.join(os.getcwd(), ‘data’)

#Create the data directory

os.makedirs(data_folder, exist_ok=True)

kaggle_folder = os.path.join(os.getcwd(), ‘.kaggle’)

#Create the data directory

os.makedirs(kaggle_folder, exist_ok=True)

kaggle_key_folder = ‘/home/azureuser/.kaggle’

#Create the .kaggle key directory

os.makedirs(kaggle_key_folder, exist_ok=True)

Step3:

Generate new API token from Kaggle account page. Ref: https://github.com/Kaggle/kaggle-api#api-credentials

This will generate and download kaggle.json in your local and it contains a line indicating your Kaggle username and the token.

Create New API Token

Step4:

Manually upload the kaggle.json generated from kaggle in .kaggle folder which contains the username and the key and then copy it to ‘/home/azureuser/.kaggle’. Once copied remove the file you uploaded.

Command:

import shutil

import os

kaggle_file = kaggle_folder + ‘/kaggle.json’

shutil.copy(kaggle_file, kaggle_key_folder)

os.remove(kaggle_file)

Step5:

chmod so that no other user can read the Kaggle.json and import.

Command:

!chmod 600 /home/azureuser/.kaggle/kaggle.json

import kaggle

!kaggle — version

Step6:

Test the API.

Command:

!kaggle — version

#List the competition files

!kaggle competitions files -c zillow-prize-1

List files

Step7:

Download the competition zipped file at the current working directory

Command:

!kaggle competitions download -c zillow-prize-1

403 — Forbidden

Wait what ?? You get a 403 !!!! You have to accept the terms and condition in the Kaggle Website to download the data and then the error should go away.

Step7a:

Accept the terms and conditions of the data usage at the Kaggle Website (if you wish to :)).

Accept the terms and conditions if you want to proceed.

Step7b:

Download the competition zipped file at the current working directory

Command:

!kaggle competitions download -c zillow-prize-1

Extract zipped file

Step8:

Extract the zipped file and list.

Command:

#Extract files in the data folder

import zipfile

with zipfile.ZipFile(“zillow-prize-1.zip”,”r”) as zip_ref:

zip_ref.extractall(data_folder)

#List the folder structure and the files

for root, directories, files in os.walk(data_folder, topdown=True):

for name in files:

print(os.path.join(root, name))

Extract and List

Lastly you can read the file using Pandas or any other libraries in Python.

Command:

import pandas as pd

properties_file = ‘/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-cpu-ds12-v2/code/Users/rabiswas/HomeValuePrediction/data/properties_2017.csv’

df_properties = pd.read_csv(properties_file)

#It is easier to view the data if we transpose

df_properties.head(3).transpose()

Read File

You can import and just execute the notebook from my GitHub repo as well: AMLS_With_Kaggle/Get_Zillow_Properties_Data.ipynb at main · RajdeepBiswas/AMLS_With_Kaggle (github.com)

Hope this helps, please let me know if you have any questions.

References

kaggle-api. (n.d.). Retrieved from github.com: https://github.com/Kaggle/kaggle-api

overview. (n.d.). Retrieved from docs.microsoft.com: https://docs.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-ml

Usmani, Z.-u.-h. (n.d.). getting-started. Retrieved from www.kaggle.com: https://www.kaggle.com/getting-started/44916

--

--

Rajdeep Biswas
Rajdeep Biswas

Written by Rajdeep Biswas

Leader Data & AI - Manufacturing. Leading an organization focused on enabling Digitally Transformative solutions through Data & AI.

No responses yet