Data Science & Python: Part 1: An Overview of Our Toolbox (unfinished)


Genesis 1:29-31 (NIV)

“Then God said, ‘I give you every seed-bearing plant on the face of the whole earth and every tree that has fruit with seed in it. They will be yours for food.

And to all the beasts of the earth and all the birds in the sky and all the creatures that move along the ground–everything that has the breath of life in it–I give every green plant for food.’

And it was so.

God saw all that He had made, and it was very good. And there was evening, and there was morning–the sixth day.”


The Tools

Data Scientists

Utilize


This multi-part post will focus on providing an overview of the Python portion of these tools: Pandas, Numpy, maplotlib, seaborn, scikit-learn, Keras, TensorFlow, PyTorch, spaCy, NLTK, Hugging Face Transformers, OpenCV.


But first, here is an overview of The Data Life Cycle. Then a list of tools will be provided.


Five Stages

of the

Data Life Cycle


The UC Berkeley School of Information provides this information about their five-stage approach to the Data Life Cycle:

  • Stage 1: Capture
    • Data acquisition, data entry, signal reception, data extraction
  • Stage 2: Maintain
    • Data warehousing, data cleansing, data staging, data processing, data architecture
  • Stage 3: Process
    • Data mining, clustering/classification, data modeling, data summarization
  • Stage 4: Analyze
    • Exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis
  • Stage 5: Communicate
    • Data reporting, data visualization, business intelligence, decision making.

Data Science Tools In The Python Toolbox


(1) Pandas


“Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures & data analysis tools” for Python. Pandas helps data practitioners “explore, clean, and process.”

Pandas is built on top of NumPy & integrates well with other 3rd party libraries within the scientific computing community.

The creators believe pandas is helpful for data scientist’s in the following stages: loading & cleaning, analyzing/modeling, and organizing the results of analysis into a plot or tabular display.

Pandas is also a dependency of statsmodels & has been utilized extensively in the production of financial applications.


Pandas can be installed with Anaconda, Miniconda, or pip (pyPl).


Pandas is great for tabular data stored in spreadsheets & databases; as there are

two basic data structures in pandas:

(1) Series: a one-dimensional labeled array holding data of any type (integers, strings, Python objects).

(2) DataFrame: a two-dimensional data structure that holds data like a two-dimension array OR table with rows & columns.


Pandas integrates with many types of file formats (csv, excel, SQL, json, parquet) with these functions:

Importing data = read_* 

Storing data = to_*

Pandas also allows you to plot your data out of the box using the Matplotlib.

.plot.*  

Pandas easily calculates basic statistics such as mean, median, min, max, & counts.

Custom aggregations can be applied: on the entire data set, on a sliding window, or grouped into categories (split/apply/combine approach).

The data tables can be reshaped with commands like:

melt() = to change your data table from wide to long/tide form

pivot() = to change your data table from long to wide format. 

Pandas can handle time series data, as well as, clean & extract information from textual data.

Data does not need to be labeled to be placed into a pandas data structure.


SQL commands like SELECT, GROUP BY, JOIN have equivalents in pandas.

A short & quick tutorial for pandas can be found here . . . .


(2) Numpy



(3) maplotlib



(4) seaborn



(5) scikit-learn



(6) Keras



(7) TensorFlow



(8) PyTorch



(9) spaCy



(10) NLTK



(11) Hugging Face Transformers



(12) OpenCV


Genesis 1:29-31 (NIV)

“Then God said, ‘I give you every seed-bearing plant on the face of the whole earth and every tree that has fruit with seed in it. They will be yours for food.

And to all the beasts of the earth and all the birds in the sky and all the creatures that move along the ground–everything that has the breath of life in it–I give every green plant for food.’

And it was so.

God saw all that He had made, and it was very good. And there was evening, and there was morning–the sixth day.”



Comments

One response to “Data Science & Python: Part 1: An Overview of Our Toolbox (unfinished)”

  1. […] For an overview of these tools in a multi-part series, click here . . . . […]

Leave a Reply

Your email address will not be published. Required fields are marked *