Data Science Bookcamp: Five real-world Python projects

Data Science Bookcamp: Five real-world Python projects

by Leonard Apeltsin
Data Science Bookcamp: Five real-world Python projects

Data Science Bookcamp: Five real-world Python projects

by Leonard Apeltsin

eBook

$43.99 

Available on Compatible NOOK Devices and the free NOOK Apps.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

- Techniques for computing and plotting probabilities
- Statistical analysis using Scipy
- How to organize datasets with clustering algorithms
- How to visualize complex multi-variable datasets
- How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data.

About the book
Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.

What's inside

- Web scraping
- Organize datasets with clustering algorithms
- Visualize complex multi-variable datasets
- Train a decision tree machine learning algorithm

About the reader
For readers who know the basics of Python. No prior data science or machine learning skills required.

About the author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.

Table of Contents
CASE STUDY 1 FINDING THE WINNING STRATEGY IN A CARD GAME
1 Computing probabilities using Python
2 Plotting probabilities using Matplotlib
3 Running random simulations in NumPy
4 Case study 1 solution
CASE STUDY 2 ASSESSING ONLINE AD CLICKS FOR SIGNIFICANCE
5 Basic probability and statistical analysis using SciPy
6 Making predictions using the central limit theorem and SciPy
7 Statistical hypothesis testing
8 Analyzing tables using Pandas
9 Case study 2 solution
CASE STUDY 3 TRACKING DISEASE OUTBREAKS USING NEWS HEADLINES
10 Clustering data into groups
11 Geographic location visualization and analysis
12 Case study 3 solution
CASE STUDY 4 USING ONLINE JOB POSTINGS TO IMPROVE YOUR DATA SCIENCE RESUME
13 Measuring text similarities
14 Dimension reduction of matrix data
15 NLP analysis of large text datasets
16 Extracting text from web pages
17 Case study 4 solution
CASE STUDY 5 PREDICTING FUTURE FRIENDSHIPS FROM SOCIAL NETWORK DATA
18 An introduction to graph theory and network analysis
19 Dynamic graph theory techniques for node ranking and social network analysis
20 Network-driven supervised machine learning
21 Training linear classifiers with logistic regression
22 Training nonlinear classifiers with decision tree techniques
23 Case study 5 solution

Product Details

ISBN-13: 9781638352303
Publisher: Manning
Publication date: 12/07/2021
Sold by: SIMON & SCHUSTER
Format: eBook
Pages: 704
File size: 20 MB
Note: This product may take a few minutes to download.

About the Author

Leonard Apeltsin is a senior data scientist and engineering lead at Primer AI, a startup that specializes in using advanced Natural Language Processing techniques to extract insight from terabytes of unstructured text data. His PhD research focused on bioinformatics that required analyzing millions of sequenced DNA patterns to uncover genetic links in deadly diseases.

Table of Contents

Preface xvii

Acknowledgments xix

About this book xxi

About the author xxv

About the cover illustration xxvi

Case Study 1 Finding the winning strategy in a card game 1

1 Computing probabilities using Python 3

1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes 4

Analyzing a biased coin 7

1.2 Computing nontrivial probabilities 8

Problem 1 Analyzing a family with four children 8

Problem 2 Analyzing multiple die rolls 10

Problem 3 Computing die-roll probabilities using weighted sample spaces 11

1.3 Computing probabilities over interval ranges 13

Evaluating extremes using interval analysis 13

2 Plotting probabilities using Matplotlib 17

2.1 Basic Matplotlib plots 17

2.2 Plotting coin-flip probabilities 22

Comparing multiple coin-flip probability distributions 26

3 Running random simulations in NumPy 33

3.1 Simulating random coin flips and die rolls using NumPy 34

Analyzing biased coin flips 36

3.2 Computing confidence intervals using histograms and NumPy arrays 38

Binning similar points in histogram plots 41

Deriving probabilities from histograms 43

Shrinking the range of a high confidence interval 46

Computing histograms in NumPy 49

3.3 Using confidence intervals to analyze a biased deck of cards 51

3.4 Using permutations to shuffle cards 54

4 Case study 1 solution 58

4.1 Predicting red cards in a shuffled deck 59

Estimating the probability of strategy success 60

4.2 Optimizing strategies using the sample space for a 10-card deck 64

Case Study 2 Assessing online ad clicks for significance 69

4.3 Problem statement 69

4.4 Dataset description 70

4.5 Overview 70

5 Basic probability and statistical analysis using SciPy 71

5.1 Exploring the relationships between data and probability using SciPy 72

5.2 Mean as a measure of centrality 76

Finding the mean of a probability distribution 83

5.3 Variance as a measure of dispersion 85

Finding the variance of a probability distribution 90

6 Making predictions using the central limit theorem and SciPy 94

6.1 Manipulating the normal distribution using SciPy 95

Comparing two sampled normal curves 99

6.2 Determining the mean and variance of a population through random sampling 103

6.3 Making predictions using the mean and variance 107

Computing the area beneath a normal curve 109

Interpreting the computed probability 112

7 Statistical hypothesis testing 114

7.1 Assessing the divergence between sample mean and population mean 115

7.2 Data dredging: Coming to false conclusions through oversampling 121

7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown 124

7.4 Permutation testing: Comparing means of samples when the population parameters are unknown 132

8 Analyzing tables using Pandas 137

8.1 Storing tables using basic Python 138

8.2 Exploring tables using Pandas 138

8.3 Retrieving table columns 141

8.4 Retrieving table rows 143

8.5 Modifying table rows and columns 145

8.6 Saving and loading table data 148

8.7 Visualizing tables using Seaborn 149

9 Case study 2 solution 154

9.1 Processing the ad-click table in Pandas 155

9.2 Computing p-values from differences in means 157

9.3 Determining statistical significance 161

9.4 41 shades of blue: A real-life cautionary tale 162

Case Study 3 Tracking disease outbreaks using news headlines 165

9.5 Problem statement 165

Dataset description 165

9.6 Overview 166

10 Clustering data into groups 167

10.1 Using centrality to discover clusters 168

10.2 K-means: A clustering algorithm for grouping data into K central groups 174

K-means clustering using scikit-learn 175

Selecting the optimal K using the. elbow method 177

10.3 Using density to discover clusters 181

10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density 185

Comparing DBSCAN and K-means 186

Clustering based on non-Euclidean distance 187

10.5 Analyzing clusters using Pandas 191

11 Geographic location visualization and analysis 194

11.1 The great-circle distance: A metric for computing the distance between two global points 195

11.2 Plotting maps using Cartopy 198

Manually installing GEOS and Cartopy 199

Utilizing the Conda package manager 199

Visualizing maps 201

11.3 Location tracking using GeoNamesCache 211

Accessing country information 212

Accessing city information 215

Limitations of the GeoNamesCache library 219

11.4 Matching location names in text 221

12 Case study 3 solution 226

12.1 Extracting locations from headline data 227

12.2 Visualizing and clustering the extracted location data 233

12.3 Extracting insights from location clusters 238

Case Study 4 Using online job posting to improve your data science resume 245

12.4 Problem statement 245

Dataset description 246

12.5 Overview 247

13 Measuring text similarities 249

13.1 Simple text comparison 250

Exploring the Jaccard similarity 255

Replacing words with numeric values 257

13.2 Vectorizing texts using word counts 262

Using normalization to improve TF vector similarity 264

Using unit vector dot products to convert between relevance metrics 272

13.3 Matrix multiplication for efficient similarity calculation 274

Basic matrix operations 277

Computing all-by-all matrix similarities 285

13.4 Computational limits of matrix multiplication 287

14 Dimension reduction of matrix data 292

14.1 Clustering 2D data in one dimension 293

Reducing dimensions using rotation 297

14.2 Dimension reduction using PCA and scikit-learn 309

14.3 Clustering 4D data in two dimensions 315

Limitations of PCA 320

14.4 Computing principal components without rotation 323

Extracting eigenvectors using power iteration 327

14.5 Efficient dimension reduction using SVD and scikit-learn 336

15 NLP analysis of large text datasets 340

15.1 Loading online forum discussions using scikit-learn 341

15.2 Vectorizing documents using scikit-learn 343

15.3 Ranking words by both post frequency and count 350

Computing TFIDF vectors with scikit-learn 356

15.4 Computing similarities across large document datasets 358

15.5 Clustering texts by topic 363

Exploring a single text cluster 368

15.6 Visualizing text clusters 372

Using subplots to display multiple word clouds 377

16 Extracting text from web pages 385

16.1 The structure of HTML documents 386

16.2 Parsing HTML using Beautiful Soup 394

16.3 Downloading and parsing online data 401

17 Case study 4 solution 404

17.1 Extracting skill requirements from job posting data 405

Exploring the HTML for skill descriptions 406

17.2 Filtering jobs by relevance 412

17.3 Clustering skills in relevant job postings 422

Grouping the job skills into 15 clusters 425

Investigating the technical skill clusters 431

Investigating the soft-skill clusters 434

Exploring clusters at alternative values of K 436

Analyzing the 700 most relevant postings 440

17.4 Conclusion 443

Case Study 5 Predicting future friendships from social network data 445

17.5 Problem statement 445

Introducing the friend-of-a-friend recommendation algorithm 446

Predicting user behavior 446

17.6 Dataset description 447

The Profiles table 447

The Observations table 448

The Friendships table 449

17.7 Overview 449

18 An introduction to graph theory and network analysis 451

18.1 Using basic graph theory to rank websites by popularity 452

Analyzing web networks using NetworkX 455

18.2 Utilizing undirected graphs to optimize the travel time between towns 465

Modeling a complex network of towns and counties 467

Computing the fastest travel time between nodes 473

19 Dynamic graph theory techniques for node ranking and social network analysis 482

19.1 Uncovering central nodes based on expected traffic in a network 483

Measuring centrality using traffic simulations 486

19.2 Computing travel probabilities using matrix multiplication 489

Deriving PageRank centrality from probability theory 492

Computing PageRank centrality using NetworkX 496

19.3 Community detection using Markov clustering 498

19.4 Uncovering friend groups in social networks 513

20 Network-driven supervised machine learning 518

20.1 The basics of supervised machine learning 519

20.2 Measuring predicted label accuracy 527

Scikit-learn's prediction measurement functions 536

20.3 Optimizing KNN performance 537

20.4 Running a grid search using scikit-learn 539

20.5 Limitations of the KNN algorithm 544

21 Training linear classifiers with logistic regression 548

21.1 Linearly separating customers by size 549

21.2 Training a linear classifier 554

Improving perceptron performance through standardization 562

21.3 Improving linear classification with logistic regression 565

Running logistic regression on more than two features 572

21.4 Training linear classifiers using scikit-learn 574

Training multiclass linear models 576

21.5 Measuring feature importance with coefficients 579

21.6 Linear classifier limitations 582

22 Training nonlinear classifiers with decision tree techniques 586

22.1 Automated learning of logical rules 587

Training a nested if/else model using two features 593

Deciding which feature to split on 599

Training if/else models with more than two features 608

22.2 Training decision tree classifiers using scikit-learn 614

Studying cancerous cells using feature importance 621

22.3 Decision tree classifier limitations 624

22.4 Improving performance using random forest classification 626

22.5 Training random forest classifiers using scikit-learn 630

23 Case study 5 solution 634

23.1 Exploring the data 635

Examining the profiles 635

Exploring the experimental observations 638

Exploring the Friendships linkage table 641

23.2 Training a predictive model using network features 645

23.3 Adding profile features to the model 652

23.4 Optimizing performance across a steady set of features 657

23.5 Interpreting the trained model 659

Why are generalizable models so important? 662

Index 665

From the B&N Reads Blog

Customer Reviews