Data Science | Pandas
Using NumPy and Pandas to Load and Organize Data
Plain text files, CSV files, SQL databases, Excel files, and many more are examples of data sources. In the last chapter, we learned how to handle some of
these data sources, but pandas is the best Python module for data preparation. In this chapter, we will learn how to use the pandas library, which is an essential tool
for data scientists. We'll discover:
NumPy and Pandas Data Wrangling Data Science
loading and storing data from a variety of data sources
A little exploratory data analysis (EDA) with pandas charting
Data preparation and cleaning, including outlier detection and imputation (filling in missing values)
Crucial data manipulation tools including groupby, replace, and filtering
Data manipulation and analysis of iTunes data
In data science, the terms "data wrangling" and "data munging" are frequently used to describe cleaning and preparing data for later applications like modeling and analytics.
Now let's work with the Chinook iTunes dataset.
With the exception of supplying an engine (library) to read the data, the function call to read_excel is comparable to read_csv.
Pandas reads Excel files by default using the xlrd library. We are specifying the openpyxl library to be used in place of xlrd due to an issue
that exists at the time of writing.
To read Excel files with pandas, you must install the xlrd or openpyxl libraries using conda or pip.
First, we establish the connection using a with block. Similar to open file ("filename") as f: from Chapter 3, SQL and Built-in File Handling Modules in Python,
this structure will automatically end the connection after the with block is finished (that is, when we are no longer indented within the with block).
As we can see, the two necessary arguments for the read_sql_query method are our query text and the SQLAlchemy connection.
Comprehending the DataFrame structure and merging or concatenating several DataFrames
DataFrames are composed of an index and several columns that hold data. One way to get the data is using this index. The index can be accessed as follows:
sql_df.index
This prints as follows:
RangeIndex (step=1, start=0, stop=3503)
This index is produced automatically when the DataFrame is loaded. Additionally, we may use the index argument to provide an index in any of the pandas read commands,
like read_csv('filename', index='index_col_name'). This allows us to see the columns we have:
Citation Youtube Python Programmer
Saving with Panda | Restructuring Data
You can use pip install pandas or conda install -c conda-forge pandas -y to install pandas if you don't already have it installed. The majority of users import pandas using the alias pd, like we did in the previous line. The pandas library is what you see everytime you see pd because we won't be importing it in the remaining examples in this chapter. We load the data using read_csv() after loading pandas, and then we use df.head() to examine the first five rows.
As you can see in Figure 4.1, Jupyter Notebook offers some lovely formatting for us in the output of the head() function.
Using Pandas to load and store data In the first example, we are employed by Apple in the iTunes analytics division. Finding any valuable information from a set of music sales data that could enhance the iTunes company is our first objective. The Chinook dataset, a sample of iTunes data used in Chapter 3, SQL and Built-in File Handling Modules in Python, will be utilized once more.
BIO About the Author: Joseph P Fanning
Joe studied at Harvard. He owns Joepfanning.com and blogs alot about computational physics
Phone - 201 334 8743
Email - Joe's App email
Suffolk County LI New York 11772 Bergen County NJ Programmer
br>
br>
br>
|