MindMap Gallery medical statistics
This is a mind map about medical statistics, including statistical descriptions of quantitative data, statistical descriptions of qualitative data, etc.
Edited at 2023-11-20 10:40:27El cáncer de pulmón es un tumor maligno que se origina en la mucosa bronquial o las glándulas de los pulmones. Es uno de los tumores malignos con mayor morbilidad y mortalidad y mayor amenaza para la salud y la vida humana.
La diabetes es una enfermedad crónica con hiperglucemia como signo principal. Es causada principalmente por una disminución en la secreción de insulina causada por una disfunción de las células de los islotes pancreáticos, o porque el cuerpo es insensible a la acción de la insulina (es decir, resistencia a la insulina), o ambas cosas. la glucosa en la sangre es ineficaz para ser utilizada y almacenada.
El sistema digestivo es uno de los nueve sistemas principales del cuerpo humano y es el principal responsable de la ingesta, digestión, absorción y excreción de los alimentos. Consta de dos partes principales: el tracto digestivo y las glándulas digestivas.
El cáncer de pulmón es un tumor maligno que se origina en la mucosa bronquial o las glándulas de los pulmones. Es uno de los tumores malignos con mayor morbilidad y mortalidad y mayor amenaza para la salud y la vida humana.
La diabetes es una enfermedad crónica con hiperglucemia como signo principal. Es causada principalmente por una disminución en la secreción de insulina causada por una disfunción de las células de los islotes pancreáticos, o porque el cuerpo es insensible a la acción de la insulina (es decir, resistencia a la insulina), o ambas cosas. la glucosa en la sangre es ineficaz para ser utilizada y almacenada.
El sistema digestivo es uno de los nueve sistemas principales del cuerpo humano y es el principal responsable de la ingesta, digestión, absorción y excreción de los alimentos. Consta de dos partes principales: el tracto digestivo y las glándulas digestivas.
medical statistics
introduction
1. Basic concepts:
Population: The totality of a certain variable value of research objects with the same or similar properties determined according to the purpose of the research.
Sample: A set of variable values of some individuals randomly selected from the population.
Overall parameters: indicators that describe overall characteristics, referred to as parameters. It is a fixed constant and generally unknown.
Statistics: An indicator that describes the characteristics of a sample. It is calculated from the sample observation values and does not contain any unknown parameters.
Sampling error: The difference between a sample statistic and the corresponding population parameter caused by random sampling.
Frequency: If event A occurs m times in n independent repeated trials, m is called frequency. Call m/n the frequency or relative frequency of event A in n trials.
Probability: The constant stabilized by frequency is called probability.
Statistical description: Use appropriate statistical indicators (sample statistics), statistical charts, and statistical tables to characterize and describe the quantitative characteristics and distribution patterns of the data.
Statistical inference: including parameter estimation and hypothesis testing. Using sample statistical indicators (statistics) to infer the overall corresponding indicators (parameters) is called parameter estimation. Using sample differences or differences between samples and the population to infer whether there may be differences between populations is called hypothesis testing.
2. Sample characteristics: sufficient sample content, reliability, and representativeness.
3. Data type:
(1) Quantitative data: measurement data
(2) Classification information:
① Counting data:
Two classifications: Classify observation objects according to two opposing attributes. The two categories are mutually opposite and incompatible with each other.
Multi-classification: Classify observation objects according to multiple mutually exclusive attributes
②Grade information:
Group design and paired design
Paired design
1. Two paired subjects received two different treatments.
2. The same subject receives two different treatments
3. Compare the results of the same subject before and after treatment (i.e. self-pairing)
4. Two parts of the same subject are treated differently
group design
Subjects were randomly assigned into two treatment groups, with each group randomly receiving one treatment.
Statistical description of quantitative data
The role of frequency tables and histograms: they are used to observe the statistical description of a large number of data, and can provide an intuitive reminder of the distribution characteristics and distribution type of the data.
Indicators and application scope of central tendency and discrete trend
central tendency
central tendency
The arithmetic mean is suitable for symmetrical distributions - not suitable for skewed distributions and data with extreme values in the data.
The geometric mean G is suitable for data with a multiple relationship or log-normal distribution, especially for positively skewed distribution - it is not suitable for data with 0 or positive and negative values appearing at the same time in the observed value.
Median M is suitable for skewed distribution of large samples; data with unknown distribution; data with uncertain values in the data
Percentile Px Multiple percentiles are used in combination to comprehensively describe the characteristics of data distribution - used to determine the range of medical reference values (skewed or unknown distribution data)
Mode M0 is suitable for large samples and is rough
Dispersion trend
Extremely bad R
Advantages: Simple, clear and easy to use.
shortcoming:
① It only reflects the difference between the maximum value and the minimum value, and cannot reflect the degree of variation of other observed values.
②The larger the sample size, the greater the range may be.
③ Extremely poor sampling error is large and unstable.
The interquartile range Q is suitable for determining the range of medical reference values and, together with the median, describes the degree of variation in skewed distribution data.
The variance and standard deviation S together with the mean describe the symmetric distribution, especially the distribution characteristics of the normal distribution or
Coefficient of variation CV ① Suitable for comparing the variability of different data in measurement units. ②Compare the variability of data with widely different means. ③ Commonly used indicators to measure experimental precision and stability.
Frequency distribution characteristics
The peak is in the middle and the left and right are roughly symmetrical, which is called a symmetric distribution. mean = median = mode
The peak value is skewed to the side of small values (left side), which is called a positively skewed distribution (also known as right-skewed distribution). Mean > Median > Mode
The peak is skewed to the side of large values (left side), which is called a negatively skewed distribution (also called left skewed). mean<median<mode
Mean & standard deviation ======== Normal or approximately normal distribution
Median & Interquartile Range === Skewed Distribution
Geometric mean & logarithmic standard deviation === Lognormal distribution
Statistical description of qualitative data
Commonly used relative numbers
Rate: Describes the frequency or intensity of a phenomenon. (Case fatality rate is not equal to death rate)
Component ratio: describes the proportion or distribution of the internal components of a phenomenon, often expressed as a percentage.
Relative ratio: Also known as ratio, it is the ratio of two related indicators A and B, indicating how many times or percentages A is B. Two indicators can be of the same nature or different.
Normalized rate
Compare the prevalence, incidence, mortality and other data of two different groups to eliminate the impact of their internal composition (age, gender, length of service, length of disease, severity of illness) on the rate
Only function: comparison (cannot be used to reflect actual levels)
Precautions:
1. The denominator for calculating relative numbers should not be too small;
2. The composition ratio cannot be used instead of rate when analyzing;
3. For several rates with different numbers of observation units, the average rate cannot be calculated directly by adding them together;
4. When comparing relative numbers, attention should be paid to their comparability;
5. The comparison of sample rates (or composition ratios) should follow random sampling and perform hypothesis testing.
Statistical tables and charts
Statistics table
Structure: It consists of titles, headings, lines and numbers.
Requirements for preparing statistical tables:
①Title: Summarize the contents of the table and list it in the center above the table. The time and location should be indicated;
②Heading: Subjects and predicates are listed in horizontal and vertical headings respectively. The text is concise and the level is clear. The horizontal headings are listed on the left side of the table and are usually the things being studied. The vertical headings are listed at the top of the table and are statistical indicators that illustrate the horizontal headings.
③ Lines: Generally, except for the top line, bottom line, vertical line under the vertical heading and the horizontal line on the total, all other lines are omitted. The top line and bottom line should be slightly thicker, and diagonal lines should not be used in the upper left corner of the table.
④Numbers: expressed with Arabic numerals. The decimal places of the same indicator should be consistent and aligned. If the number is missing or has no number, it will be expressed with "..." or "-" respectively. If the number is 0, it should be recorded as "0" and should not be Empty items should be totaled to facilitate verification and analysis.
⑤Remarks: Generally not included in the table. If necessary, they can be marked with "*" and listed under the table.
summary graph
Histogram: represents the frequency distribution of continuous data; the area of the straight rectangle represents the frequency of each group
Line graph: used for continuous data, used to illustrate the development and changes of things over time, or the change of one phenomenon with another phenomenon;
Semi-logarithmic line chart: study the speed of indicator changes
boxplot
Compare the average level and degree of variation of two or more sets of data
Each set of data can present its average level, interquartile range Q (box length), maximum value, median (middle horizontal line), P75\P25 (both ends)
The longer the box, the greater the data dispersion.
Mainly suitable for describing data with skewed distributions
Error bar chart: data used for [mutual comparison];
②Circle chart and percentage bar chart: suitable for [percent composition ratio data], indicating the [proportion or composition] of each component of a thing;
⑤Scatter plot: suitable for linear correlation analysis to illustrate the quantitative relationship and changing trend between two variables.
Numerical Estimation and Hypothesis Testing
Parameter Estimation
standard error
central limit theorem
t distribution
graphic features
1. A cluster of curves with symmetrical unimodal distribution centered on 0
2. Its morphological changes are related to the size of n (to be precise, the degree of freedom ν).
The smaller the degree of freedom ν, the flatter the t distribution curve.
The greater the degree of freedom ν, the closer the t distribution curve is to the standard normal distribution (u distribution) curve.
The degree of freedom ν is infinite, and the t distribution is a standard normal curve.
{P230}——t distribution boundary value table
The data in the table represents the size of the t value
Comparison between t distribution and normal distribution
① They are all unimodal and symmetrical distributions
②The peak of the t distribution is lower and the tail is higher
③As the degree of freedom increases, the t distribution approaches the standard normal distribution; when ν tends to ∞, the limit distribution of the t distribution is the standard normal distribution.
Confidence interval (credible interval) - find the population mean μ
two elements
Accuracy: Determined by 1-α, the larger 1-α, the higher the accuracy.
Accuracy: Determined by interval length.
The 99% confidence interval is more accurate than the 95% confidence interval. The 95% confidence interval is more precise.
e.f. Reference value range & 95% confidence interval of the overall mean
[95% confidence interval of the population mean estimate] The probability that this interval includes the population mean μ is 95%.
Do 100 samplings and calculate 100 credible intervals. On average, 95 of the credible intervals include μ (i.e., the estimate is correct), and 5 of the credible intervals do not include μ (i.e., the estimate is wrong).
95% [confidence] is the probability that the estimate is correct.
Interval estimation of [population mean] (interval estimation of a single normal population mean μ)
Interval estimation of [difference between two population means]
If the confidence interval of the obtained results includes 0, there is no significant difference.
e.g. Used to compare the difference in efficacy of two drugs
Interval estimation of [difference in probability between two populations]
hypothetical test
(1) Basic idea
(2) Basic steps
1. Establish hypotheses and determine test levels
H0: Null hypothesis, the difference exists but is not significant or the difference is not statistically significant
H1: Alternative hypothesis, the difference is significant or the difference is statistically significant
Bilateral: only care about equality
Unilateral: Concerned about whether one party is greater than the other
Compare the sample mean with the known population mean μ0
Comparison of sample mean μ1 and sample mean μ2
Test level/significance level α
2. Under the condition that the null hypothesis is established, select a statistical method and calculate the test statistic. (The error is considered to be caused by sampling)
The test method here refers to the parametric test method, which includes u test, t test and variance analysis, corresponding to different test formulas.
For two-sample data, attention should be paid to distinguishing between the data types of [group design] and [paired design].
3. Make statistical inferences based on P values
Determine P value: (Compare with the boundary value, check the boundary value table to determine the range, software calculation)
Statistics: The u test obtains the u statistic or u value; the t test obtains the t statistic or t value; the variance analysis obtains the F statistic or F value
Comparing the absolute value of the obtained statistic with the boundary value, the P value can be determined.
If P>a, accept H0 and reject H1; if P<a, reject H1 and accept H0.
When α=0.05,
The u value should be compared with the u boundary value of 1.96 to determine the P value.
If u<1.96, then P>0.05.
On the contrary, if u>1.96, then P<0.05.
The t value is compared with the t boundary value of a certain degree of freedom to determine the P value.
If the t value <t boundary value, then P>0.05.
When P>0.05, accept the null hypothesis and consider that the difference is not statistically significant, or that there is no qualitative difference between the two.
On the contrary, if t>t boundary value, then P<0.05.
When P < 0.05, the null hypothesis is rejected, the alternative hypothesis is accepted, and the difference is considered to be statistically significant, which can also be understood as a qualitative difference between the two.
One-sided tests are more likely to be positive
That is, if a one-sided test is significant, a two-sided test may not be significant, but if a two-sided test is significant, a one-sided test must be significant.
But even if the test result is P<0.01 or even P<0.001, it does not mean that the difference is very different. It only means that we are more confident that there is a difference between the two.
two types of errors
[Test Level] [Probability of Type I Error]
Represented by α
It can be single tail or double tail, usually 0.05 or 0.10
The test level describes the probability that the test will make a Type I error.
[Type II error probability]
Represented by β
Taking only one tail, the β value can be calculated
【Test efficiency】1-β
Why P<α means rejecting H0
Type 1 error is rejecting the correct null hypothesis. The significance level a is the highest level that can be tolerated for rejecting the null hypothesis, and it is also the maximum tolerable probability of a Type I error. p is the minimum requirement to reject the null hypothesis. If p>a, that is, the maximum significance set by the measurement test result must be less than the minimum level required to reject the null hypothesis. In other words, the minimum I require is greater than the maximum set, therefore, the null hypothesis cannot be rejected.