Saturday, 15 January 2011

ios - Issue for Type0 CMap parsing -



ios - Issue for Type0 CMap parsing -

i working on ios pdf scanning using pdfkitten. trying extract text searching in pdf having type0 font. not able extract text pdf. entries in tounicode missing , misinterpreted. can there issue parsing of cmap? if don't have finish cmap, how should derive it? can take external entries these missing tounicode entries?

thanks

the pdf specification offers hints on how extract text content in section 9.10.2 mapping character codes unicode values:

if font dictionary contains tounicode cmap (see 9.10.3, "tounicode cmaps"), utilize cmap convert character code unicode.

if font simple font uses 1 of predefined encodings macromanencoding, macexpertencoding, or winansiencoding, or has encoding differences array includes character names taken adobe standard latin character set , set of named characters in symbol font (see annex d):

a) map character code character name according table d.1 , font’s differences array.

b) character name in adobe glyph list (see bibliography) obtain corresponding unicode value.

if font composite font uses 1 of predefined cmaps listed in table 118 (except identity–h , identity–v) or descendant cidfont uses adobe-gb1, adobe-cns1, adobe-japan1, or adobe-korea1 character collection:

a) map character code character identifier (cid) according font’s cmap.

b) obtain registry , ordering of character collection used font’s cmap (for example, adobe , japan1) cidsysteminfo dictionary.

c) build sec cmap name concatenating registry , ordering obtained in step (b) in format registry–ordering–ucs2 (for example, adobe–japan1–ucs2).

d) obtain cmap name constructed in step (c) (available asn web site; see bibliography).

e) map cid obtained in step (a) according cmap obtained in step (d), producing unicode value.

furthermore, section 9.10.1 indicates,

an actualtext entry construction element or marked-content sequence (see 14.9.4, "replacement text") may used specify text content directly

according specification, if these methods fail produce unicode value, there no way determine character code represents. not exclusively true; e.g. embedded font programs may contain own mappings unicode; such additional sources of info beyond actual pdf format.

edit

the op provided file in question, iphoneconfigurationprofileref-2013-gm.pdf, via mail service , indicated

i getting problem every glyph.

the issue ranges nowadays in pdf not finish , different adobe-identity-cmap file.

if utilize cmap embedded in pdf, no mapping every character , if utilize adobe 1 mappings wrong.

as didn't mapping glyph, allow @ title page example.

the content stream contains these operation relevant text extraction:

bt 50 0 0 50 60 669.225 tm /g1 1 tf <0025> tj et bt 50 0 0 50 87.6 669.225 tm /g1 1 tf <005100500048004b004900570054> tj et bt 50 0 0 50 238 669.225 tm /g1 1 tf <0043> tj et bt 50 0 0 50 261.45 669.225 tm /g1 1 tf <0056004b00510050> tj et bt 50 0 0 50 355.4 669.225 tm /g1 1 tf <0032> tj et bt 50 0 0 50 380.75 669.225 tm /g1 1 tf <0054> tj et bt 50 0 0 50 396.55 669.225 tm /g1 1 tf <00510048004b004e0047> tj et bt 50 0 0 50 60 609.225 tm /g1 1 tf <0034> tj et bt 50 0 0 50 86.65 609.225 tm /g1 1 tf <00470048> tj et bt 50 0 0 50 125.05 609.225 tm /g1 1 tf <00470054> tj et bt 50 0 0 50 165.45 609.225 tm /g1 1 tf <004700500045> tj et bt 50 0 0 50 238.9 609.225 tm /g1 1 tf <0047> tj et

so need @ font g1 on page 1. fortunately font has tounicode map:

/cidinit /procset findresource begin 12 dict begin begincmap /cidsysteminfo << /registry (adobe) /ordering (ucs) /supplement 0 >> def /cmapname /adobe-identity-ucs def /cmaptype 2 def 1 begincodespacerange <0000><ffff> endcodespacerange 1 beginbfchar <000f><002d 2010> endbfchar 15 beginbfrange <0002><0002><0020> <0004><000c><0022> <000e><000e><002c> <0010><001d><002e> <001f><001f><003d> <0022><0032><0040> <0034><003d><0052> <003f><003f><005d> <0041><0041><005f> <0043><005c><0061> <005e><005e><007c> <008a><008a><00a9> <00a4><00a4><2014> <00a5><00a6><201c> <00a8><00a8><2019> endbfrange endcmap cmapname currentdict /cmap defineresource pop end end

trying apply map 1 gets (based on explicit beginbfrange...endbfrange entries):

<0025> tj % "c" = <0043> due <0022><0032><0040> <005100500048004b004900570054> tj % "onfigur" = <006f006e00660069006700750072> due <0043><005c><0061> <0043> tj % "a" = <0061> due <0043><005c><0061> <0056004b00510050> tj % "tion" = <00740069006f006e> due <0043><005c><0061> <0032> tj % "p" = <0050> due <0022><0032><0040> <0054> tj % "r" = <0072> due <0043><005c><0061> <00510048004b004e0047> tj % "ofile" = <006f00660069006c0065> due <0043><005c><0061> <0034> tj % "r" = <0052> due <0034><003d><0052> <00470048> tj % "ef" = <00650066> due <0043><005c><0061> <00470054> tj % "er" = <00650072> due <0043><005c><0061> <004700500045> tj % "enc" = <0065006e0063> due <0043><005c><0061> <0047> tj % "e" = <0065> due <0043><005c><0061>

this matches appearance of page:

ios pdf fonts adobe scanning

No comments:

Post a Comment