Data Analysis with Pandas¶

About¶

Level: Beginner to Intermediate
Lectures: 30 hours
Self-study: 15 hours
Exercises: 67
Lines of Code to write: 266
Format: e-learning + weekly online teleconference with instructor
Language: English or Polish

Description¶

This practical course teaches data analysis and visualization with Pandas, the primary tabular-data library in the Python ecosystem. Following the syllabus, students work hands-on with Series and DataFrame APIs to load, clean, transform, aggregate, and visualize real-world datasets; topics include I/O (CSV, JSON, Parquet, Excel, SQL), indexing and time series, grouping and aggregations, merges and joins, reshaping, and integration with NumPy, Matplotlib and Jupyter for interactive analysis and reporting.

Advantages¶

Participants will prototype analyses faster and more reliably by using vectorized operations, proper indexing and time-series techniques, and best practices for data cleaning and merges. The course emphasizes performance and reproducibility—covering memory-aware techniques, Parquet/Feather formats, and scalable tools like Dask and Xarray—so learners can build maintainable pipelines, produce clear visualizations, and work with datasets that exceed available RAM.

Target Audience¶

Data scientists and analysts who perform data cleaning, analysis, and visualization.
Machine learning engineers who preprocess and feature-engineer datasets.
Researchers and academics working with tabular or time-series data.
Software engineers and ETL authors building data pipelines and services.
Business analysts and BI engineers who create reports and dashboards.
Advanced students seeking reproducible, production-oriented data analysis skills.

Format¶

The course is delivered as a blended learning experience, comprising numerous short videos that progressively introduce concepts and techniques through a series of practical examples. The course format combines e-learning modules with weekly online teleconferences with the instructor for Q&A, discussions, and code reviews.

During the self-study phase, students complete practical exercises that apply the learned techniques. Each exercise is designed to have 100% test coverage, allowing students to verify their solutions. Additionally, students will have access to a spreadsheet to track their progress.

Students will also receive downloadable resources, including code samples, exercise templates, and reference materials to support their learning journey. Since 2015, we have refined our materials based on student feedback to ensure clarity, engagement, and practical relevance. All code listings undergo automatic testing (over 28,000 tests) to ensure accuracy and reliability. All materials, code listings, exercises, and assignments are handcrafted by our trainers without the use of AI. All case studies and examples are based on real-world scenarios drawn from our extensive experience in software engineering.

Working language of the course is either English or Polish.

Course Outline¶

Introduction:
- Pandas and its place in the SciPy ecosystem
- Changes in Pandas 3.0
- Pandas architecture
- Configuration options
- Core Pandas data types: Series, DataFrame, Interval, Categorical, Index
- Indexes: numeric, string, time-based (time series)
Data loading and export:
- CSV, JSON, XML, HTML, SQL
- Feather, Parquet, Pickle
- Excel, Word, PDF
- Format configuration, date parsing, user-agent modification
Working with Series:
- Creation, conversion, data types, attributes
- Indexing, selection, sampling, filtering, slicing
- Handling missing values
- Data replacement, structural changes, sorting
- Arithmetic, vectorized operations, broadcasting
- Statistics, grouping, data normalization
- Mapping: map vs apply
Working with DataFrame:
- Creation, conversion, attributes, data types
- Working with columns, index, multi-index
- Selection, sampling, access (at, iat), slicing (loc, iloc), queries (query)
- Data categorization, cleaning and normalization, regular expressions
- Working with dates: date and time formatting, time zones, conversion, time shifts
- Working with time series: frequency, Timestamp, date range, business time
- Structural changes, data replacement, filling missing values, sorting
- Statistics, data grouping, rolling and resample operations, aggregations
- Mapping: map vs apply
- Combining data: merge vs join vs concat
Data visualization:
- Principles of cooperation between Pandas and Matplotlib
- Chart types and data binding
- Charts: line, bar, box, density, others
- Chart styling and color schemes
- Changing chart titles and axis labels, axis label adjustments (rotation, formatting, frequency)
- Legend placement, grid, arrows, labels, annotations
- Charts, subplots, multiple charts in a single figure
- Export to various formats
Case studies:
- Case studies
- Retrieving data from various sources
- Data cleaning
- Selection of relevant information
- Using NumPy, Pandas, and Matplotlib methods
- Preparing data for analysis
- Data visualization
Summary:
- Pandas vs Polars vs DuckDB vs Dask
- Alternatives to Matplotlib (Bokeh, Seaborn)
- Techniques for working with data larger than available RAM
- Performance optimization tips
- Future development plan

Our Experience¶

AATC trainers have been teaching software engineering since 2015. We have already delivered over 11,000 (eleven thousand) hours of software engineering training to more than 32,000 (thirty-two thousand) students worldwide.

Requirements¶

Basic knowledge of Python programming
Familiarity with using an IDE (e.g., PyCharm, VSCode)
Familiarity with using version control systems (e.g., Git)
Basic understanding of AI-assisted coding tools (e.g., GitHub Copilot, ChatGPT)

Setup¶

Newest version of Python
IDE of your choice (e.g., PyCharm, VSCode)
Git installed and configured
GitHub account
Web browser (e.g., Chrome, Firefox, Safari, etc.)

Apply¶

If you are interested in taking this course, please contact us at info@astronaut.center