If you are a Data Scientist, Data Analyst or just an enthusiast, you should not miss some extremely popular and useful libraries for Python.
In this article, totally 15 Python libraries will be listed and briefly introduced. I believe most of them you may have already familiar, but if not, it is highly recommended to go check them out by yourself.
These libraries will be classified into several categories, that are
- Data Gathering
- Data Cleansing and Transformation
- Data Visualisation
- Data Modelling
- Audio and Image Recognition
Most of Data Analytics projects start from data gathering and extraction. Sometimes, the dataset might be given when you work for a certain company to solve an existing problem. However, the data might not be ready-made and you may need to collect it by yourself. The most common scenario is that you need to crawl the data from the Internet.
Scrapy is probably the most popular Python library when you want to write a Python crawler to extract information from websites. For example, you could use it to extract all the reviews for all the restaurants in a city or collect all the comments for a certain category of products on an e-commerce website.
The typical usage is to identify the pattern of the interesting information appearing on web pages, both in terms of the URL patterns and XPath patterns. Once these patterns are figured out, Scrapy can help you automatically extract all the needed information and organise them in a data structure such as tabular and JSON.
You can easily install Scrapy using
pip install scrapy
2. Beautiful Soup
Beautiful Soup is yet another Python library for scraping Web content. It is generally accepted that it has a relatively shorter learning curve compare with Scrapy.
Also, Beautiful Soup will be a better choice for relatively smaller-scaled problems and/or just a one-time job. Unlike Scrapy that you have to develop your own “spider” and go back to command-line the run it, Beautiful Soup allows you to import its functions and use them in-line. Therefore, you could even use it in your Jupyter notebooks.
Originally, Selenium was developed to be an automated Web testing framework. However, developers found that it is quite convenient to use it as a Web scraper.
However, it is important to be noted that Selenium will be much slower than the normal scraping libraries. This is because it actually initialises a web browser such as Chrome and then simulates all the actions defined in the code.
Therefore, when you are dealing with URL patterns and XPaths, do use Scrapy or Beautiful Soup. Only choose Selenium if you have to.
Data Cleansing and Transformation
I guess it is not necessary to claim how data cleansing and transformation are important in data analytics and data science. Also, there are too many outstanding Python libraries that do these well. I’ll pick up some of them which you must know as a Data Scientist or Analyst.
I am almost sure that listing Pandas in this list is unnecessary. As long as you are dealing with data, you must have used Pandas.
With Pandas, you can manipulate data in a Pandas Data Frame. There are enormous built-in functions that help you to transform your data.
Don’t need too many words. If you want to learn Python, this is a must-learn library.
Similarly, Numpy is another must-learn library for Python language users, even not only for Data Scientists and Analysts.
It extended Python list objects into comprehensive multi-dimensional arrays. There is also a huge number of built-in mathematical functions to support almost all your needs in terms of calculation. Typically, you can use Numpy arrays as matrices and Numpy will allow you to perform matrix calculations.
I believe many Data Scientist will start there Python scripts as follows
import numpy as np
import pandas as pd
So, it is sure that these two libraries are probably the most popular ones in the Python community.
Spacy is probably not as famous as the previous ones. While Numpy and Pandas are the libraries dealing with numeric and structured data, Spacy helps us to convert free text into structured data.
Spacy is one of the most popular NLP (Natural Language Processing) libraries for Python. Imagine that when you scraped a lot of product reviews from an e-commerce website, you have to extract useful information from these free text before you can analyse them. Spacy has numerous built-in features to assist, such as work tokeniser, named entity recognition, and part-of-speech detection.
Also, Spacy support many different human languages. On its official site, it is claimed that it supports more than 55 ones.
Data Visualisation is absolutely an essential need in Data Analytics. We need to visualise the results and outcomes and telling the data story that we have found.
I have written another article to introduce Matplotlib. Check out this if you want to read more about it.
An Introduction to Python Matplotlib with 40 Basic ExamplesMatplotlib is one of the most popular libraries in Python. In this article, 40 basic examples are provided for you to…levelup.gitconnected.com
Honestly, although I believe Matplotlib is a must-learn library for visualisation, most of the times I would prefer to use Plotly because it enables us to create the fanciest graphs in fewest lines of code.
No matter you want to build a 3D surface plot, a map-based scatter plot or an interactive animated plot, Plotly can fulfil the requirements in a short time.
It also provides a chart studio that you can upload your visualisation to an online repository which supports further editing and persistence.
When data analytics comes to modelling, we usually refer it to Advanced Analytics. Nowadays, machine learning is already not a novel concept. Python is also considered as the most popular language for machine learning. Of course, there are a lot of outstanding libraries supporting this.
9. Scikit Learn
Before you dive into “deep learning”, Scikit Learn should be the Python library you to start your path on machine learning.
Scikit Learn has 6 major modules that do
- Data Pre-Processing
- Dimensions Reduction
- Model Selection
I’m sure that a Data Scientist who has nailed Scikit Learn should already be considered as a good Data Scientist.
PyTorch is authored by Facebook and open-sourced as a mutual machine learning framework for Python.
Compare to Tensorflow, PyTorch is more “pythonic” in terms of its syntax. which also made PyTorch a bit easier to learn and start to use.
Finally, as a deep-learning focus library, PyTorch has very rich API and built-in functions to assist Data Scientists to quickly train their deep learning models.
Tensorflow another machine learning library for Python that was open-sourced by Google.
One of the most popular features of Tensorflow is the Data Flow Graphs on the Tensorboard. The latter is an automatically generated Web-based dashboard visualising the machine learning flows and outcomes, which is extremely helpful for debugging and presentation purposes.
Audio and Image Recognition
Machine learning is not only on numbers but also can help on audio and images (videos are considered as a series of image frames). Therefore, when we deal with these multimedia data, those machine learning libraries will not be enough. Here are some popular audio and image recognition libraries for Python.
Librosa is a very powerful audio and voice processing Python library. It can be utilised to extract various kinds of features from audio segments, such as the rhythm, beats and tempo.
With Librosa, those extremely complicated algorithms such as the Laplacian segmentation can be easily implemented in a few lines of code.
OpenCV is the most ubiquitously used library for image and video recognition. It is not exaggerated to say that OpenCV enables Python to replace Matlab in terms of image and video recognition.
It provides various APIs and supports not only Python but also Java and Matlab, as well as outstanding performance, which earns much appreciation both in the industry and academic research.
Don’t forget that Python was commonly used in Web Development before it comes popular in the data science area. So, there are also a lot of excellent libraries for web development.
If you want to use Python to develop a Web service backend, Django is always the best choice. It is designed to be a high-level framework that can build a website in very few lines of code.
It directly supports most of the popular databases to save your time to set up the connections and data model development. You would only focus on the business logic and never worried about CURD manipulations with Django because it is a database-driven framework.
Flask is a light-weight Web development framework in Python. The most valuable feature is that it can be easily customised with any specific requirements very easy and flexible.
A lot of other famous Python libraries and tools which provides Web UI are built using Flask such as Plotly Dash and Airflow because of Flask’s light-weight feature.
Indeed, there are more prominent Python libraries that are eligible to be listed in here. It is always exciting to see that Python’s community is such thriving. In case if there are more libraries become one of the must-known ones for Data Scientists and Analysts, there might be necessary to organise them in another article.
Life is short, so I love Python!
Written by Christopher Tao, Principal Consultant@AtoBI Australia
Follow Christopher on LinkedIn Here