|
相信很多人都會(huì)把條形圖、直方圖、柱狀圖混著叫,難以說(shuō)出其中區(qū)別。 在ggplot2中,其實(shí)只有兩個(gè)函數(shù) geom_bar()和 geom_histogram(),分別對(duì)應(yīng)了條形圖(也有人喜歡叫柱狀圖...),以及直方圖。 所以,這兩個(gè)函數(shù)的區(qū)別在哪? 使用 ggplot2 包中提供的 diamonds 數(shù)據(jù)集作為測(cè)試數(shù)據(jù): https://ggplot2./reference/diamonds.html A dataset containing the prices and other attributes of almost 54,000 diamonds.
p_load(ggplot2)
data(diamonds)
繪制前10個(gè)鉆石的價(jià)格分布: TestData = diamonds[1:10,] # 使用前10個(gè)數(shù)據(jù)
條形圖(Barplot) 在 ggplot2 中圖層函數(shù) geom_bar() 可以繪制條形圖: https://ggplot2./reference/geom_bar.html geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead.
ggplot(TestData, aes(x = price)) + geom_bar()

該圖中橫坐標(biāo)為價(jià)格,縱坐標(biāo)為每個(gè)價(jià)格對(duì)應(yīng)的鉆石數(shù),所以最終鉆石數(shù)總計(jì)為10! 此時(shí)如果我們繪制所有的 diamonds 數(shù)據(jù): nrow(diamonds) # 53940
ggplot(diamonds, aes(x = price)) + geom_bar()

可見(jiàn)橫坐標(biāo)price分布過(guò)于密集,因?yàn)槊總€(gè)價(jià)格都被繪制,因?yàn)椋?/span> geom_bar() uses stat_count() by default: it counts the number of cases at each x position.
查看 geom_bar() 函數(shù)源碼: function (mapping = NULL, data = NULL, stat = 'count', position = 'stack',
..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)
可知,默認(rèn) stat='count' ,即 geom_bar() 默認(rèn)對(duì)橫坐標(biāo)的每個(gè)點(diǎn)(價(jià)格)統(tǒng)計(jì)數(shù)目! 但是,如果想將價(jià)格分割/區(qū)域化,例如統(tǒng)計(jì)每100價(jià)格區(qū)間對(duì)應(yīng)的鉆石數(shù)目,可以設(shè)定 binwidth 參數(shù): ggplot(diamonds, aes(x = price)) + geom_bar(binwidth = 100)
正常得出結(jié)果: 
但有如下警告: Warning message: geom_bar() no longer has a binwidth parameter. Please use geom_histogram() instead.
即 geom_bar() 函數(shù)將不再支持 binwidth 參數(shù)。雖然當(dāng)前還能使用,但建議使用 geom_histogram() 繪制這種對(duì)數(shù)據(jù)進(jìn)行分割/區(qū)域的圖形! 直方圖(Histogram) https://ggplot2./reference/geom_histogram.html geom_histogram() for continuous data.
Visualise the distribution of a single continuous variable by dividing the x axis into binsand counting the number of observations in each bin.
geom_histogram() is an alias for geom_bar() plus stat_bin():
stat_bin(), which bins data in ranges and counts the cases in each range. It differs from stat_count(), which counts the number of cases at each x position (without binning into ranges). stat_bin() requires continuous x data, whereas stat_count() can be used for both discrete and continuous x data.
可見(jiàn),與 geom_bar() 中默認(rèn)使用的 stat_count() 對(duì)單個(gè)點(diǎn)的計(jì)數(shù)不同, geom_histogram() 默認(rèn)使用 stat_bin() 將數(shù)據(jù)基于一定的范圍分割,并分別統(tǒng)計(jì)每個(gè)范圍(bin)內(nèi)的數(shù)目。 查看 geom_histogram 源碼可知,默認(rèn)使用 stat='bin' ,且默認(rèn)分為30個(gè)bins: ggplot(diamonds, aes(x = price)) + geom_histogram()
stat_bin() using bins=30. Pick better value with binwidth

同樣,可以將范圍設(shè)置為100: ggplot(diamonds, aes(x = price)) + geom_histogram(binwidth = 100)
此時(shí),得到的圖形結(jié)果將與前文中 geom_bar(binwidth=100) 的結(jié)果一致! 而如果在 geom_histogram() 中設(shè)置 stat='count',則繪圖效果等同于 geom_bar() ggplot(diamonds, aes(x = price)) + geom_histogram(stat = 'count')

|