pconline_1.0.0: 商品爬取策略.html

File 商品爬取策略.html, 18.5 KB (added by kongfanqi, 8 years ago)

商品爬取策略

Line 
1<!doctype html>
2<html>
3<head>
4<meta charset='UTF-8'><meta name='viewport' content='width=device-width initial-scale=1'>
5<title>商品爬取策略.md</title><link href='http://fonts.googleapis.com/css?family=Open+Sans:400italic,700italic,700,400&subset=latin,latin-ext' rel='stylesheet' type='text/css' /><style type='text/css'>html {overflow-x: initial !important;}:root { --bg-color: #ffffff; --text-color: #333333; --code-block-bg-color: inherit; }
6html { font-size: 14px; background-color: var(--bg-color); color: var(--text-color); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; -webkit-font-smoothing: antialiased; }
7body { margin: 0px; padding: 0px; height: auto; bottom: 0px; top: 0px; left: 0px; right: 0px; font-size: 1rem; line-height: 1.42857; overflow-x: hidden; background: inherit; }
8a:active, a:hover { outline: 0px; }
9.in-text-selection, ::selection { background: rgb(181, 214, 252); text-shadow: none; }
10#write { margin: 0px auto; height: auto; width: inherit; word-break: normal; word-wrap: break-word; position: relative; padding-bottom: 70px; white-space: pre-wrap; overflow-x: visible; contain: layout paint; }
11.for-image #write { padding-left: 8px; padding-right: 8px; }
12body.typora-export { padding-left: 30px; padding-right: 30px; }
13@media screen and (max-width: 500px) {
14  body.typora-export { padding-left: 0px; padding-right: 0px; }
15  .CodeMirror-sizer { margin-left: 0px !important; }
16  .CodeMirror-gutters { display: none !important; }
17}
18.typora-export #write { margin: 0px auto; }
19#write > p:first-child, #write > ul:first-child, #write > ol:first-child, #write > pre:first-child, #write > blockquote:first-child, #write > div:first-child, #write > table:first-child { margin-top: 30px; }
20#write li > table:first-child { margin-top: -20px; }
21img { max-width: 100%; vertical-align: middle; }
22input, button, select, textarea { color: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; font-size: inherit; line-height: inherit; font-family: inherit; }
23input[type="checkbox"], input[type="radio"] { line-height: normal; padding: 0px; }
24::before, ::after, * { box-sizing: border-box; }
25#write p, #write h1, #write h2, #write h3, #write h4, #write h5, #write h6, #write div, #write pre { width: inherit; }
26#write p, #write h1, #write h2, #write h3, #write h4, #write h5, #write h6 { position: relative; }
27h1 { font-size: 2rem; }
28h2 { font-size: 1.8rem; }
29h3 { font-size: 1.6rem; }
30h4 { font-size: 1.4rem; }
31h5 { font-size: 1.2rem; }
32h6 { font-size: 1rem; }
33p { -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; -webkit-margin-start: 0px; -webkit-margin-end: 0px; }
34.typora-export p { white-space: normal; }
35.mathjax-block { margin-top: 0px; margin-bottom: 0px; -webkit-margin-before: 0rem; -webkit-margin-after: 0rem; }
36.hidden { display: none; }
37.md-blockmeta { color: rgb(204, 204, 204); font-weight: bold; font-style: italic; }
38a { cursor: pointer; }
39sup.md-footnote { padding: 2px 4px; background-color: rgba(238, 238, 238, 0.7); color: rgb(85, 85, 85); border-radius: 4px; }
40#write input[type="checkbox"] { cursor: pointer; width: inherit; height: inherit; margin: 4px 0px 0px; }
41tr { break-inside: avoid; break-after: auto; }
42thead { display: table-header-group; }
43table { border-collapse: collapse; border-spacing: 0px; width: 100%; overflow: auto; break-inside: auto; text-align: left; }
44table.md-table td { min-width: 80px; }
45.CodeMirror-gutters { border-right: 0px; background-color: inherit; }
46.CodeMirror { text-align: left; }
47.CodeMirror-placeholder { opacity: 0.3; }
48.CodeMirror pre { padding: 0px 4px; }
49.CodeMirror-lines { padding: 0px; }
50div.hr:focus { cursor: none; }
51pre { white-space: pre-wrap; }
52.CodeMirror-gutters { margin-right: 4px; }
53.md-fences { font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; overflow: visible; white-space: pre; background: var(--code-block-bg-color); position: relative !important; }
54.md-diagram-panel { width: 100%; margin-top: 10px; text-align: center; padding-top: 0px; padding-bottom: 8px; overflow-x: auto; }
55.md-fences .CodeMirror.CodeMirror-wrap { top: -1.6em; margin-bottom: -1.6em; }
56.md-fences.mock-cm { white-space: pre-wrap; }
57.show-fences-line-number .md-fences { padding-left: 0px; }
58.show-fences-line-number .md-fences.mock-cm { padding-left: 40px; }
59.footnotes { opacity: 0.8; font-size: 0.9rem; padding-top: 1em; padding-bottom: 1em; }
60.footnotes + .footnotes { margin-top: -1em; }
61.md-reset { margin: 0px; padding: 0px; border: 0px; outline: 0px; vertical-align: top; background: transparent; text-decoration: none; text-shadow: none; float: none; position: static; width: auto; height: auto; white-space: nowrap; cursor: inherit; -webkit-tap-highlight-color: transparent; line-height: normal; font-weight: normal; text-align: left; box-sizing: content-box; direction: ltr; }
62li div { padding-top: 0px; }
63blockquote { margin: 1rem 0px; }
64li p, li .mathjax-block { margin: 0.5rem 0px; }
65li { margin: 0px; position: relative; }
66blockquote > :last-child { margin-bottom: 0px; }
67blockquote > :first-child { margin-top: 0px; }
68.footnotes-area { color: rgb(136, 136, 136); margin-top: 0.714rem; padding-bottom: 0.143rem; }
69@media print {
70  html, body { border: 1px solid transparent; height: 99%; break-after: avoid; break-before: avoid; }
71  .typora-export * { -webkit-print-color-adjust: exact; }
72  h1, h2, h3, h4, h5, h6 { break-after: avoid-page; orphans: 2; }
73  p { orphans: 4; }
74  html.blink-to-pdf { font-size: 13px; }
75  .typora-export #write { padding-left: 1cm; padding-right: 1cm; padding-bottom: 0px; break-after: avoid; }
76  .typora-export #write::after { height: 0px; }
77  @page { margin: 20mm 0mm; }
78}
79.footnote-line { margin-top: 0.714em; font-size: 0.7em; }
80a img, img a { cursor: pointer; }
81pre.md-meta-block { font-size: 0.8rem; min-height: 2.86rem; white-space: pre-wrap; background: rgb(204, 204, 204); display: block; overflow-x: hidden; }
82p .md-image:only-child { display: inline-block; width: 100%; text-align: center; }
83#write .MathJax_Display { margin: 0.8em 0px 0px; }
84.mathjax-block { white-space: pre; overflow: hidden; width: 100%; }
85p + .mathjax-block { margin-top: -1.143rem; }
86.mathjax-block:not(:empty)::after { display: none; }
87[contenteditable="true"]:active, [contenteditable="true"]:focus { outline: none; box-shadow: none; }
88.task-list { list-style-type: none; }
89.task-list-item { position: relative; padding-left: 1em; }
90.task-list-item input { position: absolute; top: 0px; left: 0px; }
91.math { font-size: 1rem; }
92.md-toc { min-height: 3.58rem; position: relative; font-size: 0.9rem; border-radius: 10px; }
93.md-toc-content { position: relative; margin-left: 0px; }
94.md-toc::after, .md-toc-content::after { display: none; }
95.md-toc-item { display: block; color: rgb(65, 131, 196); }
96.md-toc-item a { text-decoration: none; }
97.md-toc-inner:hover { }
98.md-toc-inner { display: inline-block; cursor: pointer; }
99.md-toc-h1 .md-toc-inner { margin-left: 0px; font-weight: bold; }
100.md-toc-h2 .md-toc-inner { margin-left: 2em; }
101.md-toc-h3 .md-toc-inner { margin-left: 4em; }
102.md-toc-h4 .md-toc-inner { margin-left: 6em; }
103.md-toc-h5 .md-toc-inner { margin-left: 8em; }
104.md-toc-h6 .md-toc-inner { margin-left: 10em; }
105@media screen and (max-width: 48em) {
106  .md-toc-h3 .md-toc-inner { margin-left: 3.5em; }
107  .md-toc-h4 .md-toc-inner { margin-left: 5em; }
108  .md-toc-h5 .md-toc-inner { margin-left: 6.5em; }
109  .md-toc-h6 .md-toc-inner { margin-left: 8em; }
110}
111a.md-toc-inner { font-size: inherit; font-style: inherit; font-weight: inherit; line-height: inherit; }
112.footnote-line a:not(.reversefootnote) { color: inherit; }
113.md-attr { display: none; }
114.md-fn-count::after { content: "."; }
115.md-tag { opacity: 0.5; }
116.md-comment { color: rgb(162, 127, 3); opacity: 0.8; font-family: monospace; }
117code { text-align: left; }
118h1 .md-tag, h2 .md-tag, h3 .md-tag, h4 .md-tag, h5 .md-tag, h6 .md-tag { font-weight: initial; opacity: 0.35; }
119a.md-print-anchor { border-width: initial !important; border-style: none !important; border-color: initial !important; display: inline-block !important; position: absolute !important; width: 1px !important; right: 0px !important; outline: none !important; background: transparent !important; text-decoration: initial !important; text-shadow: initial !important; }
120.md-inline-math .MathJax_SVG .noError { display: none !important; }
121.mathjax-block .MathJax_SVG_Display { text-align: center; margin: 1em 0em; position: relative; text-indent: 0px; max-width: none; max-height: none; min-height: 0px; min-width: 100%; width: auto; display: block !important; }
122.MathJax_SVG_Display, .md-inline-math .MathJax_SVG_Display { width: auto; margin: inherit; display: inline-block !important; }
123.MathJax_SVG .MJX-monospace { font-family: monospace; }
124.MathJax_SVG .MJX-sans-serif { font-family: sans-serif; }
125.MathJax_SVG { display: inline; font-style: normal; font-weight: normal; line-height: normal; zoom: 90%; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; }
126.MathJax_SVG * { transition: none; }
127.md-diagram-panel > svg { max-width: 100%; }
128[lang="flow"] svg, [lang="mermaid"] svg { max-width: 100%; }
129
130
131:root { --side-bar-bg-color: #fafafa; --control-text-color: #777; }
132@font-face { font-family: "Open Sans"; font-style: normal; font-weight: normal; src: local("Open Sans Regular"), url("./github/400.woff") format("woff"); }
133@font-face { font-family: "Open Sans"; font-style: italic; font-weight: normal; src: local("Open Sans Italic"), url("./github/400i.woff") format("woff"); }
134@font-face { font-family: "Open Sans"; font-style: normal; font-weight: bold; src: local("Open Sans Bold"), url("./github/700.woff") format("woff"); }
135@font-face { font-family: "Open Sans"; font-style: italic; font-weight: bold; src: local("Open Sans Bold Italic"), url("./github/700i.woff") format("woff"); }
136html { font-size: 16px; }
137body { font-family: "Microsoft YaHei", "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; color: rgb(51, 51, 51); line-height: 1.6; }
138#write { max-width: 860px; margin: 0px auto; padding: 20px 30px 100px; }
139#write > ul:first-child, #write > ol:first-child { margin-top: 30px; }
140body > :first-child { margin-top: 0px !important; }
141body > :last-child { margin-bottom: 0px !important; }
142a { color: rgb(65, 131, 196); }
143h1, h2, h3, h4, h5, h6 { position: relative; margin-top: 1rem; margin-bottom: 1rem; font-weight: bold; line-height: 1.4; cursor: text; }
144h1:hover a.anchor, h2:hover a.anchor, h3:hover a.anchor, h4:hover a.anchor, h5:hover a.anchor, h6:hover a.anchor { text-decoration: none; }
145h1 tt, h1 code { font-size: inherit; }
146h2 tt, h2 code { font-size: inherit; }
147h3 tt, h3 code { font-size: inherit; }
148h4 tt, h4 code { font-size: inherit; }
149h5 tt, h5 code { font-size: inherit; }
150h6 tt, h6 code { font-size: inherit; }
151h1 { padding-bottom: 0.3em; font-size: 2.25em; line-height: 1.2; border-bottom: 1px solid rgb(238, 238, 238); }
152h2 { padding-bottom: 0.3em; font-size: 1.75em; line-height: 1.225; border-bottom: 1px solid rgb(238, 238, 238); }
153h3 { font-size: 1.5em; line-height: 1.43; }
154h4 { font-size: 1.25em; }
155h5 { font-size: 1em; }
156h6 { font-size: 1em; color: rgb(119, 119, 119); }
157p, blockquote, ul, ol, dl, table { margin: 0.8em 0px; }
158li > ol, li > ul { margin: 0px; }
159hr { height: 4px; padding: 0px; margin: 16px 0px; background-color: rgb(231, 231, 231); border-width: 0px 0px 1px; border-style: none none solid; border-top-color: initial; border-right-color: initial; border-left-color: initial; border-image: initial; overflow: hidden; box-sizing: content-box; border-bottom-color: rgb(221, 221, 221); }
160body > h2:first-child { margin-top: 0px; padding-top: 0px; }
161body > h1:first-child { margin-top: 0px; padding-top: 0px; }
162body > h1:first-child + h2 { margin-top: 0px; padding-top: 0px; }
163body > h3:first-child, body > h4:first-child, body > h5:first-child, body > h6:first-child { margin-top: 0px; padding-top: 0px; }
164a:first-child h1, a:first-child h2, a:first-child h3, a:first-child h4, a:first-child h5, a:first-child h6 { margin-top: 0px; padding-top: 0px; }
165h1 p, h2 p, h3 p, h4 p, h5 p, h6 p { margin-top: 0px; }
166li p.first { display: inline-block; }
167ul, ol { padding-left: 30px; }
168ul:first-child, ol:first-child { margin-top: 0px; }
169ul:last-child, ol:last-child { margin-bottom: 0px; }
170blockquote { border-left: 4px solid rgb(221, 221, 221); padding: 0px 15px; color: rgb(119, 119, 119); }
171blockquote blockquote { padding-right: 0px; }
172table { padding: 0px; word-break: initial; }
173table tr { border-top: 1px solid rgb(204, 204, 204); margin: 0px; padding: 0px; }
174table tr:nth-child(2n) { background-color: rgb(248, 248, 248); }
175table tr th { font-weight: bold; border: 1px solid rgb(204, 204, 204); text-align: left; margin: 0px; padding: 6px 13px; }
176table tr td { border: 1px solid rgb(204, 204, 204); text-align: left; margin: 0px; padding: 6px 13px; }
177table tr th:first-child, table tr td:first-child { margin-top: 0px; }
178table tr th:last-child, table tr td:last-child { margin-bottom: 0px; }
179.CodeMirror-gutters { border-right: 1px solid rgb(221, 221, 221); }
180.md-fences, code, tt { border: 1px solid rgb(221, 221, 221); background-color: rgb(248, 248, 248); border-radius: 3px; font-family: "Microsoft YaHei", Consolas, "Liberation Mono", Courier, monospace; padding: 2px 4px 0px; font-size: 0.9em; }
181.md-fences { margin-bottom: 15px; margin-top: 15px; padding: 8px 1em 6px; }
182.task-list { padding-left: 0px; }
183.task-list-item { padding-left: 32px; }
184.task-list-item input { top: 3px; left: 8px; }
185@media screen and (min-width: 914px) {
186}
187@media print {
188  html { font-size: 13px; }
189  table, pre { break-inside: avoid; }
190  pre { word-wrap: break-word; }
191}
192.md-fences { background-color: rgb(248, 248, 248); }
193#write pre.md-meta-block { padding: 1rem; font-size: 85%; line-height: 1.45; background-color: rgb(247, 247, 247); border: 0px; border-radius: 3px; color: rgb(119, 119, 119); margin-top: 0px !important; }
194.mathjax-block > .code-tooltip { bottom: 0.375rem; }
195#write > h3.md-focus::before { left: -1.5625rem; top: 0.375rem; }
196#write > h4.md-focus::before { left: -1.5625rem; top: 0.285714rem; }
197#write > h5.md-focus::before { left: -1.5625rem; top: 0.285714rem; }
198#write > h6.md-focus::before { left: -1.5625rem; top: 0.285714rem; }
199.md-image > .md-meta { border: 1px solid rgb(221, 221, 221); border-radius: 3px; font-family: "Microsoft YaHei", Consolas, "Liberation Mono", Courier, monospace; padding: 2px 4px 0px; font-size: 0.9em; color: inherit; }
200.md-tag { color: inherit; }
201.md-toc { margin-top: 20px; padding-bottom: 20px; }
202.sidebar-tabs { border-bottom: none; }
203#typora-quick-open { border: 1px solid rgb(221, 221, 221); background-color: rgb(248, 248, 248); }
204#typora-quick-open-item { background-color: rgb(250, 250, 250); border-color: rgb(254, 254, 254) rgb(229, 229, 229) rgb(229, 229, 229) rgb(238, 238, 238); border-style: solid; border-width: 1px; }
205#md-notification::before { top: 10px; }
206.on-focus-mode blockquote { border-left-color: rgba(85, 85, 85, 0.12); }
207header, .context-menu, .megamenu-content, footer { font-family: "Microsoft YaHei", "Segoe UI", Arial, sans-serif; }
208.file-node-content:hover .file-node-icon, .file-node-content:hover .file-node-open-state { visibility: visible; }
209.mac-seamless-mode #typora-sidebar { background-color: var(--side-bar-bg-color); }
210.md-lang { color: rgb(180, 101, 77); }
211
212
213
214
215
216
217</style>
218</head>
219<body class='typora-export' >
220<div  id='write'  class = 'is-node'><h2><a name='header-n0' class='md-header-anchor '></a>商品爬取策略</h2><h4><a name='header-n2' class='md-header-anchor '></a>步骀1 确定爬取电商范囎</h4><ol start='' ><li>现阶段重点爬取京䞜、倩猫、苏宁和囜矎具䜓电商平台及权重诊见《商品品类&amp;电商》</li></ol><h4><a name='header-n7' class='md-header-anchor '></a>步骀2 确定内郚商品品类</h4><ol start='' ><li>内郚品类参考什么倌埗买确定䞀二䞉级诊见《商品品类&amp;电商》</li></ol><h4><a name='header-n12' class='md-header-anchor '></a>步骀3 爬取商品</h4><ol start='' ><li><p>盎接䜿甚䞉级品类圚电商平台进行搜玢查询获取搜玢结果页的商品数据自劚園类到对应的品类䞋方</p></li><li><p><strong>埅需试验确定的问题</strong></p><p>1郚分电商䌚识别出搜玢关键词圚䞍同品类䞭劂搜玢“男装”䌚出现服饰品类、囟乊音像品类等目前暂定分类属性默讀第䞀项</p><p>2电商搜玢结果智胜匹配皋床高搜玢结果䌚存圚盞关皋床䞍高的商品䜆歀类结果电商郜䌚将展瀺权重调后因歀届时需芁再确定每䞪品类圚每䞪电商平台的爬取页码范囎</p></li></ol><h4><a name='header-n24' class='md-header-anchor '></a>步骀4 获取商品属性</h4><ol start='' ><li>电商搜玢结果页䌚自劚将对应品类的商品属性以条件筛选的圢匏展瀺采集商品数据时需芁顺垊将这些商品属性项和属性倌收圕并映射到对应的商品</li></ol><h4><a name='header-n29' class='md-header-anchor '></a>步骀5 获取商品具䜓信息</h4><p>确定商品爬取字段诊见《商品爬取字段衚》</p><p>App对字段数据的倄理展瀺</p><h4><a name='header-n34' class='md-header-anchor '></a>步骀6 识别䞍同平台同欟商品</h4><ol start='' ><li><p>识别规则</p><p>1䌘先按品类筛选</p><p>2品牌是吊䞀臎</p><p>3商品描述商品标题的文本盞䌌床</p></li><li><p>埅研究的问题</p><p>1需芁分词盞关技术预研涉及词语划分、词性权重、语义识别等盞关技术</p><p>2需芁电商行䞚词库甚于识别语义已确定词语对应的意思这方面词库需芁采莭目前暂未发现有效的采莭来源</p></li></ol><h4><a name='header-n52' class='md-header-anchor '></a>步骀7 商品曎新</h4><ol start='' ><li>埅匀发确定商品曎新频率</li></ol><p><br/></p><h2><a name='header-n59' class='md-header-anchor '></a><strong>目前项目执行计划暂定</strong></h2><ol start='' ><li>同匀发同事沟通后匀发同事䌚先圚重点电商平台䞊试抓取䞀些热闚品类数据确定技术方案后期再党平台党品类适甚</li><li>由于每䞪平台每䞪品类郜需芁匀发单独去指定规则目前䞉级品类有688䞪每䞪品类规则确定到第䞀次爬取起码芁消耗1倩的时闎</li><li>圚3.15䞊线前只胜䌘先倄理热闚平台和热闚品类后续逐步完善时闎问题䜆这点目前老倧们䞍理解沟通存圚障碍。</li></ol></div>
221</body>
222</html>