Understanding these terms can help you understand the fundamental concepts and procedures of data science more successfully.
1. Data Wrangling (or Data Munging)
The process of cleaning and unifying complex data sets so that they can be easily accessed and analyzed.
2. Exploratory Data Analysis (EDA)
A technique for visualizing and summarizing the essential aspects of data, sometimes using statistical graphics, in order to identify trends or anomalies.
3. Supervised Learning
A type of machine learning in which a model is trained on labeled data, or data with known input-output pairings.
4. Unsupervised Learning
A type of machine learning in which a model learns patterns and structures from unlabeled data.
5. Regression
A statistical method for modeling the link between dependent and independent variables, typically to predict a continuous outcome.
6. Classification
A method of supervised learning that groups data into labels or classes that are predetermined.
7. Clustering
A method of unsupervised learning (k-means, hierarchical clustering) that groups together similar data points.
8. Overfitting
When a machine learning model learns not just the underlying patterns in the training data but also the noise, resulting in poor generalization to fresh data.
9. Bias-Variance Tradeoff
The trade-off between a model’s low variance (strong generalization) and low bias (predict accurate training data).
10. Cross-Validation
a technique for assessing a model’s performance that separates data into training and validation sets; k-fold cross-validation is usually used.
11. Dimensionality Reduction
the method of minimizing the number of variables in a data collection while preserving the greatest amount of information (e.g., utilizing t-SNE, PCA).
12. Feature Engineering
the procedure for enhancing model performance by generating new input characteristics from unprocessed data.
13. Big Data
refers to extraordinarily massive and complex data sets that need to be analyzed using cutting-edge processing tools like Hadoop and Spark.
14. Artificial Neural Networks (ANN)
A human brain-inspired machine learning model used in deep learning for tasks such as image recognition and natural language processing.
15. Natural Language Processing (NLP)
A field that studies the interface of computers and human language, allowing for the processing and analysis of text and speech data.
16. Hyperparameters
Machine learning models have parameters that are defined before the learning process begins and affect the training process (for example, learning rate and batch size).
17. A/B Testing
A statistical method for comparing two variations of a variable (for example, web page design or marketing plan) to see which works better.
18. ROC Curve (Receiver Operating Characteristic Curve)
A graphical figure used to assess the performance of binary classifiers, displaying the trade-off between true and false positive rates.
19. Confusion Matrix
A table used to assess the effectiveness of a classification model, displaying the number of true positives, true negatives, false positives, and false negatives.
20. Precision and Recall
Precision is the proportion of true positive forecasts among all positive predictions.
Recall: The percentage of true positives found out of all actual positives.
21. F1 Score
The harmonic mean of precision and recall is used to balance these two measures in classification issues.
22. Gradient Descent
An optimization approach for minimizing the cost function in machine learning models by iterative parameter adjustment.
23. Ensemble Learning
Combining predictions from different models to increase accuracy (for example, random forest and boosting).
24. Time Series Analysis
A strategy for analyzing data points collected or recorded at regular intervals in order to find trends, seasonal patterns, and cycles.
25. Deep Learning
A type of machine learning that employs neural networks with multiple layers to represent complicated patterns in data, particularly effective in applications such as image and speech recognition.