Data Preprocessing Questions
The common techniques used for data sampling are:
1. Random Sampling: This technique involves selecting a random subset of data from the entire dataset. It ensures that each data point has an equal chance of being selected.
2. Stratified Sampling: In this technique, the dataset is divided into homogeneous subgroups or strata based on certain characteristics. Then, a random sample is taken from each stratum to ensure representation from each subgroup.
3. Cluster Sampling: This technique involves dividing the dataset into clusters or groups and randomly selecting a few clusters. Then, all the data points within the selected clusters are included in the sample.
4. Oversampling: This technique is used when the dataset is imbalanced, meaning one class or category has significantly fewer samples than others. It involves replicating or adding more instances of the minority class to balance the dataset.
5. Undersampling: This technique is also used for imbalanced datasets but involves reducing the number of instances from the majority class to balance the dataset.
6. Systematic Sampling: In this technique, a fixed interval is used to select data points from the dataset. For example, every 10th data point can be selected to form the sample.
7. Stratified Random Sampling: This technique combines stratified sampling and random sampling. It involves dividing the dataset into strata and then randomly selecting samples from each stratum.
These techniques are used to ensure that the selected sample is representative of the entire dataset and reduces bias in the analysis.