This blog was written on Microsoft.com by Brad Severston, Gopi Kumar, Paul Shealy, and C.J. Gronlund. To read the original post, click here.
The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server and on Linux. We offer Windows edition of DSVM on Server 2016 and Server 2012. We offer Linux edition of the DSVM on Ubuntu 16.04 LTS and on OpenLogic 7.2 CentOS-based Linux distributions.
This topic discusses what you can do with the Data Science VM, outlines some of the key scenarios for using the VM, itemizes the key features available on the Windows and Linux versions, and provides instructions on how to get started using them.
What can I do with the Data Science Virtual Machine?
The goal of the Data Science Virtual Machine is to provide data professionals at all skill levels and roles with a friction-free data science environment. This VM saves you considerable time that you would spend if you had rolled out a comparable environment on your own. Instead, start your data science project immediately in a newly created VM instance.
The Data Science VM is designed and configured for working with a broad range of usage scenarios. You can scale your environment up or down as your project needs change. You are able to use your preferred language to program data science tasks. You can install other tools and customize the system for your exact needs.
This section suggests some key scenarios for which the Data Science VM can be deployed.
Preconfigured analytics desktop in the cloud
The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden and saving on the time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.
Data science training and education
Enterprise trainers and educators that teach data science classes usually provide a virtual machine image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.
On-demand elastic capacity for large-scale projects
Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments requiring high-powered computing resources to be run.
Short-term experimentation and evaluation
The Data Science VM can be used to evaluate or learn tools such as Microsoft ML Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios such as replicating published experiments, executing demos, following walkthroughs in online sessions or conference tutorials.
The data science VM can be used for training model using deep learning algorithms on GPU (Graphics processing units) based hardware. Utilizing VM scaling capabilities of Azure cloud, DSVM helps you use GPU-based hardware on the cloud as per need. One can switch to a GPU-based VM when training large models or need high-speed computations while keeping the same OS disk. The Windows Server 2016 edition of DSVM comes pre-installed with GPU drivers, frameworks and GPU version of the deep learning algorithms. On the Linux, deep learning on GPU is enabled only on the Data Science Virtual Machine for Linux (Ubuntu) edition. You can deploy the Ubuntu/Windows-2016 edition of Data Science VM to non GPU-based Azure virtual machine in which case all the deep learning frameworks will fallback to the CPU mode. Earlier, for Windows Server 2012 we published a Deep learning toolkit but now we recommend using Windows Server 2016 for Windows-based deep learning workloads. The CentOS-based Linux edition of the DSVM contains only the CPU builds of some of the deep learning tools (Microsoft Cognitive Toolkit, TensorFlow, MXNet) but does not come preinstalled with the GPU drivers and frameworks.
What’s included in the Data Science VM?
The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products. You can explore and build predictive models on large-scale data sets using the Microsoft ML Server (R, Python) or using SQL Server 2017. A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks. The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.
|Tool||Windows Edition||Linux Edition|
|Microsoft R Open with popular packages pre-installed||Y||Y|
|Microsoft ML Server (R, Python) Developer Edition includes,
*RevoScaleR/revoscalepy parallel and distributed high-performance framework (R & Python)
*MicrosoftML– New state-of-the-art ML algorithms from Microsoft
*R and Python Operationalization
|Microsoft Office Pro-Plus with shared activation – Excel, Word and PowerPoint||Y||N|
|Anaconda Python 2.7, 3.5 with popular packages pre-installed||Y||Y|
|JuliaPro with popular packages for Julia language pre-installed||Y||Y|
|Relational Databases||SQL Server 2017
|Database tools||* SQL Server Management Studio
* SQL Server Integration Services
* ODBC/JDBC drivers
|*SQuirreL SQL(querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
|Scalable in-database analytics with SQL Server ML services (R, Python)||Y||N|
|Jupyter Notebook Server with following kernels,||Y||Y|
|* Python 2.7 & 3.5||Y||Y|
|*Sparkmagic||N||Y (Ubuntu Only)|
|JupyterHub (Multi-user notebooks server)||N||Y|
|Development tools, IDEs and Code editors|
|*Visual Studio 2017 (Community Edition)>with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools,Node.js,Python, and R Tools for Visual Studio (RTVS)||Y||N|
|*Visual Studio Code||Y||Y|
|*Juno (Julia IDE)||Y||Y|
|* Vim and Emacs||Y||Y|
|* Git and GitBash||Y||Y|
|* .Net Framework||Y||N|
|SDKs to access Azure and Cortana Intelligence Suite of services||Y||Y|
|Data Movement and management Tools|
|* Azure Storage Explorer||Y||Y|
|* Azure Powershell||Y||N|
|*Adlcopy(Azure Data Lake Storage)||Y||N|
|*DocDB Data Migration Tool||Y||N|
|*Microsoft Data Management Gateway: Move data between OnPrem and Cloud||Y||N|
|* Unix/Linux Command-Line Utilities||Y||Y|
|Apache Drill for Data exploration||Y||Y|
|Machine Learning Tools|
|* Integration with Azure Machine Learning(R, Python)||Y||Y|
|*LightGBM||N||Y (Ubuntu Only)|
|*H2O||N||Y (Ubuntu only)|
|GPU-based Deep Learning Tools||Windows Server 2016 edition||Ubuntu edition|
|*Microsoft Cognitive Toolkit (formerly known as CNTK)||Y||Y|
|*Caffe & Caffe2||N||Y|
|*CUDA, CUDNN, Nvidia Driver||Y||Y|
|Big Data Platform (Devtest only)|
|* LocalHadoop(HDFS, YARN)||N||Y|
Getting started with the Windows Data Science VM
- Create an instance of the desired Windows DSVM edition by navigating to
- Windows Server 2016 based DSVM
Windows Server 2012 based DSVM
- Click the GET IT NOW button.
- Sign in to the VM from your remote desktop using the credentials you specified when you created the VM.
- To discover and launch the tools available, click the Start menu.
For the Windows Data Science VM
- For more information on how to run specific tools available on the Windows version, see Provision the Microsoft Data Science Virtual Machine
- For more information on how to perform various tasks needed for your data science project on the Windows VM, see Ten things you can do on the Data science Virtual Machine.
For the Linux Data Science VM
- For more information on how to run specific tools available on the Linux version, see Provision the Linux Data Science Virtual Machine.
- For a walkthrough that shows you how to perform several common data science tasks with the Linux VM, see Data science on the Linux Data Science Virtual Machine.